Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Manisha Sangwan PG Student, M.Tech VLSI Design

SENSE, VIT University Chennai, India - 600048

[email protected]

A Anita Angeline Professor

SENSE, VIT University Chennai, India -600048

Abstract—Floating point numbers are used in various applications such as medical imaging, radar, telecommunications Etc. This paper deals with the comparison of various arithmetic modules and the implementation of optimized floating point ALU. Here pipelined architecture is used in order to increase the performance and the design is achieved to increase the operating frequency by 1.62 times. The logic is designed using Verilog HDL. Synthesis is done on Encounter by Cadence after timing and logic simulation.

Keywords—CLA; clock-cycles; GDM; HDL; IEEE 754; pipelining; verilog

I. INTRODUCTION

These days the use of computers has been incorporated in many applications such as medical imaging, radar, audio system design, signal processors, industrial control, telecommunications and other applications. There are many key factors that are considered before choosing the number system. Those preferred factors are computational capabilities required for the application, processor and system costs, accuracy, complexity and performance attributes. Over the years the designers moved from fixed point to floating point operations due to wide range along with the ability of floating point numbers to represent a very small number to a very large number but at the same time accuracy of floating point numbers is less. So trade off has to be done to get the optimized architecture.

Almost twenty years ago IEEE 754 standard was adopted for floating point numbers. This single precision standard is for 32-bit number and the double precision is for 64- bit number.

The storage layout consists of three components: the sign, the exponent and mantissa. The mantissa includes implicit leading bit and the fractional part.

TABLE I. FLOATING POINT REPRESENTATION

Sign Exponent Fractional Bias Single Precision 1[31] 8[30-23] 23[22-00] 127 Double Precision 1[63] 11[62-52] 52[62-52] 1023

Sign Bit: It defines whether the number the number is positive or negative. If it is 0 then number is positive else negative.

Exponent: Both positive and negative values are represented by this field. To do this, a bias is added to the actual exponent in order to get the stored exponent. [10] For single precision this value is 127 and for double precision this value is 1023.

Mantissa: mantissa bit consists of both implicit leading bit and fractional part it is represented in the form of 1.f where 1 is implicit and f is fractional part mantissa is also known as significant.

II. IMPLEMENTATION

A. Adder and Subtractor Algorithm

Fig. 1. Block diagram of Floating Point Adder and Subtractor

In adders the propagation of carry from one adder block to another block consumes a lot of time but Carry Look Ahead (CLA) adder save this propagation time by generating and propagating this carry simultaneously in the consecutive blocks. So, for faster operation this carry look ahead adder is used.

2013 International Conference on Advanced Electronic Systems (ICAES)

F1 E1 F2 E2 F3 E3

S[i] = X[i] Y[i] C[i]

G[i] = X[i] * Y[i]

P[i] = X[i] +Y[i]

C[i+1] = X[i] * Y[i] + X[i] * C[i] + Y[i] * C[i] C[i+1] = G[i] + P[i] * G[i] + P[i] * P[i-1] * C[i-1]

B. Multiplication Algorithm

Fig. 2. Block diagram of Floating Point Multiplier

Multiplication is an important block for ALU. With high speed and low power multipliers there comes the complexity so, we need to do the trade off between these to get the optimized algorithm with the regularity of layout. There are different algorithms available for multiplication such as Booth, modified booth, Wallace, Bough Wooley, Braun multipliers. But issue with multipliers is speed and regular layout so keeping both the parameters in mind modified booth algorithm was chosen. It is a powerful algorithm for signed-number multiplication, which treats both positive and negative numbers uniformly.

C. Division Algorithm

Fig. 3. Block diagram of Floating Point Division

For division process Goldsmith (GDM) algorithm is used. For this algorithm to be applied needs both the inputs to be normalized first. In this both the multiplication of inputs is independent of each other, so can be executed in parallel, therefore it will increase the latency.

Algorithm for GDM for Q = A/B using k iterations:

• B !=0, |e0| < 1 • Initialize N = A, D = B, R = (1-e0) / B • For I = 0 to k • N = N*R • D = D*R • R = 2 – D • End for • Q = N

Return Q

D. Pipelining The speed of execution of any instruction can be varied by

a number of methods like, using a faster circuit technology to build a processor or to arrange the hardware in such a way so that multiple operations can be performed at the same time. [11] By using pipelining multiple operations can be performed simultaneously without changing the execution time of an instruction. As shown in example below, in sequential execution the third instruction will be executed at sixth clock cycle but in pipelined architecture the same instruction will be executed at fourth clock cycle and we’ll be saving two clock cycles at the end where, F is fetching block and E is execution block. Therefore, as the number of instruction count will be increasing we’ll be saving more clock cycles.

1 2 3 4 5 6

|______I1______| |_______I2______| |_______I3______|

Fig. 4. Sequential Execution

Clock Cycles: 0 1 2 3 4

F1 E1 F2 E2

F3 E3Fig. 5. Pipelined Execution

III. FUNCTIONAL AND TIMING VERIFICATION The functional verification is done using both Cadence and

Xilinx along with that the arithmetic results are verified theoretically. Timing analyses, power analyses and area analyses are done using both Cadence and Xilinx.


A. Adder and Subtractors In addition and subtraction, all the sign, exponential and

fractional bits are operated separately and then the exponents are shifted accordingly to equate the exponents then this addition operation is done on fractional bits. The final result is combined to make the output of 32 bits [Fig 6].

Fig. 6. Simulation Waveform for Adder and Subtractor

B. Multiplier In multiplication block, the exponents are added and the

fractional bits are multiplied according to the algorithm. To get the sign of the result, the XOR operation is performed on both the input sign bits. Finally all the bits are clubbed to get the final result [Fig 7].

Fig. 7. Simulation Waveform for Multiplier

C. Division The exponents are subtracted and the fractional bits are

multiplied and subtracted according to the GDM algorithm and continues iterations are done to get the closest result, for sign bit XOR operation is performed on the sign bits of inputs. This block is most time and area consuming [Fig 8].

Fig. 8. Simulation Waveform for Division

D. ALU Layout In Fig 9, the final layout of the circuit is shown.

Fig. 9. ALU Layout

IV. SYNTHESIS RESULT Synthesis results are shown in the table 2, shown below.

TABLE II. COMPARATIVE ANALYSIS OF BOTH EXISTING AND PROPOSED DESIGN

Existing ProposedLeakage power 2.880282 W 3.50267 W

Dynamic Power 11.377751 mW 16.14882 mWTotal power 11.380632 mW 16.15232 mWGate counts 2881 3712Frequency 225.65 MHz 367.654 MHz

Critical path 4.43164 nsec 2.70 nsecLogic utilization 1% (466/38000) 4%(1780/46560)

IOS 44% (130/296) 65% (157/ 240)Area 75436 97194

V. CONCLUSION In this paper various arithmetic modules are implemented

and then various comparative analyses are done. Ultimately these individual blocks are clubbed to make Floating point based ALU in a pipelined manner to minimize the power and to increase the operating frequency at the same time. These comparative analyses are done on cadence and Xilinx both. Simulation results are verified theoretically. Verilog HDL (Hardware Description Language) is used to design the whole ALU block. In existing design, total power is 11.380632 mW that is 0.70458 times less as compared to the proposed design but the operating frequency is 1.67 times more than the existing design. Along with that the gate count is increased and area is also increased because of number of iterations used in the algorithm.


VI. FUTURE WORK Optimization of source code to decrease the area and gate

counts will improve the reliability. Low power techniques could be incorporated to obtain better trade off.

REFERENCES [1] Addanki Purna Ramesh, Ch. Pradeep, “FPGA Based Implementation of

Double Precision Floating Point Adder/Subtractor Using VERILOG”, International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 7, July 2012

[2] Semih Aslan, Erdal Oruklu and Jafar Saniie, “A High Level Synthesis and Verification Tool for Fixed to Floating Point Conversion”, 55th IEEE Internation Midwest Symposium on Circuits and Systems (MWSCAS 2012)

[3] Prashanth B.U.V, P. Anil Kumai, G. Sreenivasulu, “Design and Implementation of Floating Point ALU on a FPGA Processor”, International Conference on Computing, Electronics and Electrical Technologies (ICCEET 2012), 2012

[4] Subhajit Banerjee Purnapatra, Siddharth Kumar, Subrata Bhattacharya, “Implementation of Floating Point Operations on Fixed Point Processor – An Optimization Algorithm and Comparative Analysis”, IEEE 10th International Conference on Computer Information Technology (CIT 2010), 2010

[5] Ghassem Jaberipur, Behrooz Parhami, and Saeid Gorgin, “RedundantDigit Floating-Point Addition Scheme Based on a Stored Rounding Value”, IEEE transactions on computer, vol. 59, no.

[6] Alexandre F. Tenca, “Multi-operand Floating-point Addition”, 19th IEEE International Symposium on Computer Arithmetic, 2009.

[7] Cornea, “IEEE 754-2008 Decimal Floating-Point for Intel® Architecture Processors”, 19th IEEE International Symposium on Computer Arithmetic, 2009.

[8] Joy Alinda P. Reyes, Louis P. Alarcon, and Luis Alarilla, “A Study of Floating-Point Architectures for Pipelined RISC Processors”, IEEE International Symposium on Circuits and Systems, 2006.

[9] Peter-Michael Seidel, “High-Radix Implementation of IEEE FloatingPoint Addition”, Proceedings of the 17th IEEE Symposium on Computer Arithmetic, 2005.

[10] Guillermo Marcus, Patricia Hinojosa, Alfonso Avila and Juan NolazcoFlores, “A Fully Synthesizable Single-Precision, Floating-Point Adder/Subtractor and Multiplier in VHDL for General and Educational Use”, Proceedings of the 5thIEEE International Caracas Conference on Devices, Circuits and Systems, Dominican Republic, Nov.3-5, 2004.

[11] Carl Hamacher, Zvonko Vranesic, Safwat Zaky “Computer Organization” 5th Edition, Tata McGraw-Hill Education, 2011.


Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Documents

Transcript of Design and Implementation of Single Precision Pipelined Floating Point Co-Processor