Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

94
1 Chapter-1 INTRODUCTION 1.1 Introduction Datapath is the core of every microprocessor, digital signal processor (DSP) and application- specific integrated circuit (ASIC). The heart of this datapath inturn comprises of various arithmetic units performing various computation intensive arithmetic functions, such as adders, and multiplier, multiply and accumulate unit. High performance DSP make extensive use of Multiply and Accumulate (MAC) unit. MAC is the most crucial element as it lies in the critical path of the datapath circuit. Therefore, Main motivation behind this thesis is to accelerate the speed of MAC and consequently, enhance the speed of DSP. MAC performs two important function first is the Multiplication and second is accumulation. In DSP applications, multiplication is the most critical operation as switching and critical computations of a multipliers are quite high, compared to other datapath units of a processing architecture. Hence, for all multiplication algorithms being implemented in DSP‟s latency and throughput are the two major concerns from delay perspective. Latency is nothing but the real delay of any computing function which simply measures how long the inputs to a device are stable in the final results at the output. Whereas, Throughput is basically the number multiplications performed in a given period of time. So, the real time signal processing requires high speed and high throughput multiplier units that consumes low power which is always a key to achieve high overall performance of DSP. Therefore, the development of high throughput multipliers has been a subject of interest over decades.. The most common multiplication algorithms which are implemented in digital systems include array multiplication, Wallace tree multiplication and booth encoding. Array multiplier is a fast multiplier as the partial products are generated in parallel so the execution speed increases. The delay associated with the array multiplier is the time taken by the signals to propagate through the gates forming the multiplication array. The main disadvantage of this multiplier is that the worst case delay increases with the width of array so it is limited to small bit multiplication. Wallace tree multiplier sum up the partial products using carry save adders consequently reduces delay but it has a complex layout as compared to array multiplier as it uses a number

Transcript of Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

Page 1: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

1

Chapter-1

INTRODUCTION

1.1 Introduction

Datapath is the core of every microprocessor, digital signal processor (DSP) and application-

specific integrated circuit (ASIC). The heart of this datapath inturn comprises of various

arithmetic units performing various computation intensive arithmetic functions, such as

adders, and multiplier, multiply and accumulate unit. High performance DSP make extensive

use of Multiply and Accumulate (MAC) unit. MAC is the most crucial element as it lies in

the critical path of the datapath circuit. Therefore, Main motivation behind this thesis is to

accelerate the speed of MAC and consequently, enhance the speed of DSP. MAC performs

two important function first is the Multiplication and second is accumulation.

In DSP applications, multiplication is the most critical operation as switching and critical

computations of a multipliers are quite high, compared to other datapath units of a

processing architecture. Hence, for all multiplication algorithms being implemented in DSP‟s

latency and throughput are the two major concerns from delay perspective. Latency is

nothing but the real delay of any computing function which simply measures how long the

inputs to a device are stable in the final results at the output. Whereas, Throughput is

basically the number multiplications performed in a given period of time. So, the real time

signal processing requires high speed and high throughput multiplier units that consumes low

power which is always a key to achieve high overall performance of DSP. Therefore, the

development of high throughput multipliers has been a subject of interest over decades..

The most common multiplication algorithms which are implemented in digital systems

include array multiplication, Wallace tree multiplication and booth encoding. Array

multiplier is a fast multiplier as the partial products are generated in parallel so the execution

speed increases. The delay associated with the array multiplier is the time taken by the signals

to propagate through the gates forming the multiplication array. The main disadvantage of

this multiplier is that the worst case delay increases with the width of array so it is limited to

small bit multiplication.

Wallace tree multiplier sum up the partial products using carry save adders consequently

reduces delay but it has a complex layout as compared to array multiplier as it uses a number

Page 2: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

2

of irregular wires. Booth encoding multiplier offers an advantage over array multiplier as it

reduces the computation complexity by reducing the number of partial products being

generated but increased speed comes at the cost of increased circuit complexity.

First and foremost, some ancient and basic multiplication algorithms have been discussed to

explore the computer arithmetic from a different point of view. Then, some world renowned

Vedic mathematics algorithms known for yield quicker results have been discussed. In

general, for a NXN bit multiplication N2 multiplication are needed. Kasturba algorithm in

vedic mathematics has brought down the complexity to N1.58

. Then, urdhvatiryakbhyam sutra

or “vertically or crosswise” algorithm is discussed which reduces the time complexity by

breaking the multiplication of N*N bit numbers to N/2 * N/2 bit number multiplication and

the process continues till we reach 2*2 size multiplicands. This sutra efficiently deals with

large numbers. This work presents a systematic design methodology using

urdhvatiryakbhyam sutra for the developing a high throughput and speed and area efficient

multiplexer based Vedic multiplier.

In order to fulfill our motivation of enhanced speed the proposed MAC is realized using a

multiplier based on Urdhvatriyakbhyam sutra. Accumulator is again a CIAF involving large

operand additions. So, the work deals with the implementation of various conventional adders

and Parallel Prefix Adders. Study and analysis of these adders have revealed that as the

operand size becomes greater than or equal to 32-bits Conventional adders like Ripple Carry

adders and Carry look ahead adders face a disadvantage of reduced speed, fan-in limitations,

area complexities. So, to enhance the performance of wider adders Han Carlson parallel

prefix adder can prove to be a challenging and promising adder in terms of speed. So, the

proposed MAC has been implemented using Han Carlson PPA .

Work also presents the logic optimization of adder. As adder is a key element of the vedic

multiplier so more stress has to be laid on its optimization. Full adders with 10 different

factorized expression for carry and sum has been implemented in the thesis. Study reveals

that Multiplexer based implementation of circuits reduce the slice utilization ratio of the

adder and makes the design area efficient.

Further, another concept of pipelining has been introduced to increase the throughput of the

multiplier and hence reduce the power dissipation in the device and ultimately improve the

Page 3: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

3

performance of the device. Pipelining reduces the effective critical path by introducing

pipelining latches along the data path and thus reduces the average time per instruction.

1.2 Problem Statement

The main objective of this Dissertation is to implement high throughput 32*32 bit Vedic

multiplier which fulfills our motivation of reduced area, reduced delay and reduced power

consumption.

1.3 Objective of Dissertation

I. Qualitative evaluation of four different existing 32 * 32 urdhvatiryakbhyam multipliers.

II. Implementation of a novel urdhva multiplier architecture using BEC-1.

III. Realization of multi-operand carry save adder in terms of multiplexers using logic

optimized full adder circuit

IV. Comparison of area, delay and power results of proposed multiplier architecture with

various existing urdhva multiplier architectures.

V. Pipelining of the proposed 32*32 bit vedic multiplier to increase the throughput.

VI. Implementation of 64-bit han Carlson adder in the Accumulator unit and its performance

evaluation.

1.4 Organization of Dissertation

This Dissertation is organized as follows:

Chapter-1: Gives the brief introduction about the use of MAC and the importance of vedic

multipliers to enhance overall performance of DSP.

Chapter-2: Deals with the literature review for the presented work done and examined a

comprehensive background of other related research works for which IEEE papers and other

referred journals are contributed which relate to the present work with recent research work

going on worldwide and has assured the consistency of the work performed.

Chapter-3: This Chapter discusses the various ancient multiplication algorithms like

Page 4: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

4

Urdhvatiryakbhyam, nikhilam. It also throws light on the performance of vedic multiplier in

terms of speed, area and power. Chapter also briefly explain logic optimization of digital

circuits. Lastly, it discusses the technique of pipelining to improve the throughput of the

system.

Chapter-4: Deals with the Analysis of four existing architectures of high speed urdhva

multipliers and gives details of each of them.

Chapter-5: Describes the proposed work on MUX-based implementation of hight

throughput MAC using Han Carlson adder and explains the designing and basic building

blocks of proposed 32x32 bit Vedic MAC. The chapter also deals with the logic optimization

and Multiplexer based implementation of the proposed module. Synthesis results of all the

modules incorporated in the designing of the proposed module have been shown for its

comparison with the existing Urdhva Architectures. Lastly, it throws light on the pipelining

approach used in MAC to accelerate its performance.

Chapter-6: Deals with the simulation results obtained after the successful implementation of

proposed work in VHDL. Chapter also compares the performance of proposed vedic

multiplier to the Existing architecture at different operand sizes in terms of area, power,

delay and space complexity. It also compares Various adders and shows the significance of

Han Carlson adder in the proposed MAC. Chapter also compares the performance of

proposed multiplier with and without the use of pipeline registers.

Chapter-7: Gives the conclusion and future scope of the Dissertation.

1.5 Tools Used:

Software: ISE 12.4 (Integrated system environment) has been used for synthesis and

verification. ISIM M.81d has been used for simulation.

Hardware used: Xilinx Spartan3 (Family), XC3S400 -5 (Speed Grade) , PQ208

(package) FPGA devices.

Page 5: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

5

Chapter-2

LITERATURE REVIEW

Study of various multipliers reveals that Multiplication is an important fundamental function

in arithmetic operations. Multiplication-based operations such as Multiply and

Accumulate(MAC) and inner product are among some of the frequently used

Computation- Intensive Arithmetic Functions(CIAF) currently implemented in many

Digital Signal Processing (DSP) applications such as convolution, Fast Fourier

Transform(FFT), filtering and in microprocessors in its arithmetic and logic unit [1].

Since multiplication dominates the execution time of most DSP algorithms, so there is

a need of high speed multiplier. Currently, multiplication time is still the dominant factor

in determining the instruction cycle time of a DSP chip. The demand for high speed

processing has been increasing as a result of expanding computer and signal

processing applications. Higher throughput arithmetic operations are important to achieve

the desired performance in many real-time signal and image processing applications [2].

One of the key arithmetic operations in such applications is multiplication and the

development of fast multiplier circuit has been a subject of interest over decades.

Reducing the time delay and power consumption are very essential requirements for

many applications [2, 3]. Digital multipliers are the core components of all the digital

signal processors (DSPs) and the speed of the DSP is largely determined by the speed of its

multipliers.

Various methods exist for the reduction in the computation time involved by the multiplier

with other factors as trade-offs. three types (a) shift-and-add multipliers that generate partial

products sequentially and accumulate. This requires more hardware and is the slowest

multiplier. This is basically the array multiplier making use of the classical multiplying

technique which consumes more time to perform two subtasks, addition and shifting of the

bits and hence consumes 2 to 8 cycles of clock period (b) Generating all the partial product

bits in parallel and accumulate them using a multi-operand adder. This is also called as

parallel multiplier by using the techniques of Wallace tree and Booth algorithm, (c) using

arrays of almost identical cells for generation of bit products and accumulation.[4]

The most common multiplication algorithms followed in the digital hardware are array

Page 6: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

6

multiplication algorithm and Booth multiplication algorithm. The computation time taken by

the array multiplier is comparatively less because the partial products are calculated

independently in parallel. The delay associated with the array multiplier is the time taken by

the signals to propagate through the gates that form the multiplication array. Booth

multiplication is another important multiplication algorithm. Large booth arrays are required

for high speed multiplication and exponential operations which in turn require large partial

sum and partial carry registers. Multiplication of two n-bit operands using a radix-4 booth

recording multiplier requires approximately n / (2m) clock cycles to generate the least

significant half of the final product, where m is the number of Booth recorder adder stages.

Thus, a large propagation delay is associated with this case [5]. Another type of multiplier

which is the Wallace tree multiplier is considered as faster than a simple array multiplier. A

Wallace tree multiplier is a parallel multiplier which uses the carry save addition algorithm to

reduce the latency, but it has a complex layout as compared to array multiplier as it uses a

number of irregular wires [6].

These multipliers involve n2 multiplications for multiplying two n-bit multiplicands which

makes it complex. Vedic mathematics provides a number of algorithms which yield quicker

results compared to above multipliers. Vedic Mathematics is the ancient system of

mathematics which was rediscovered early last century by Sri Bharati Krishna Tirthaji (1884-

1960) [7] .The Sanskrit word “Veda” means “knowledge”. He organized and classified the

whole of Vedic Mathematics into 16 formulae or also called as sutras. Among these

techniques more preferable method is Urdhvatiryakbhyam method to describe

Urdhvatiryakbhyam methodology [8] and their hardware architecture [9] details and

implementation [10] presented.

Various architectures of Urdhvatiryakbhyam sutra based multipliers have come into existence

the conventional vedic multiplier architecture was proposed in [11] with the ripple carry

adder in Vedic multiplication unit for 4 bit binary numbers. In this approach, three 4-bit

ripple carry adders are used and the combinational path delay is found to be 13.102 ns.

Results when compared with Array and Booth Multiplier and it are observed that the

execution time has been reduced for Vedic multiplier and thus proves to be better but since,

the ripple carry adder waits for the carry from previous stage so it restricts the speed of

multiplier. This architecture was extended upto 32 bit[12] and it was found to be faster than

Page 7: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

7

the existing multipliers. The modified architecture [13] replaced the ripple carry adder with

carry-look ahead adder (CLA) and it was found that vedic multipliers using CLA are more

power efficient as compared to conventional architecture. Carry lookahead adders are fast as

they do not depend on the carry from the previous stage so rippling effect is eliminated [14].

This research Work [15] presents the efficiency of Urdhva Triyagbhyam Vedic method

for multiplication which strikes a difference in actual process of multiplication itself.

Modified architecture enables the parallel generation of partial products and eliminates

unwanted multiplication An analysis of the best adder among some commonly available

adders is carried out and the best adder is used for adding the partial product

generated in the Vedic multiplication technique to reduce the combinational delay in the

critical path also summation of partial products is also done in a parallel fashion which in turn

reduces the combinational path delay. Since adder are the main units which restrict the speed

of multiplier so fast adder have to be implemented to increase the speed. Parallel Prefix

adders are the fastest adders present. The Han-Carlson adder is a blend of the Brent-Kung

and Kogge-Stone adders. It uses one Brent-Kung stage at the beginning followed by

Kogge-Stone stages, terminating with another Brent-Kung stage to compute the odd

numbered prefixes [17]. It provides better performance compared to Kogge-Stone for

smaller adders. High performance Vedic multiplier using Han-Carlson adder proposed [16]

the benefit of using Han-Carlson adder is its high operational speed. Synthesis results shows

that the performance parameters such as area and delay are reduced compared to

multiplier using Kogge-Stone adder with lower number of bits, which makes it more power

efficient. Due to its regular and parallel structure the proposed design can be realized on

silicon as well. The proposed multiplier is very useful for the microprocessors and DSP

processors whose performance is dependent upon the efficiency of multiplier. A simple

digital multiplier architecture based on the Urdhvatiryakbhyam (Vertically and Cross

wise) Sutra of Vedic Mathematics is presented [18] with an improved technique for low

power and high speed multiplier of two binary numbers (16 bit each) is developed.

An algorithm is proposed and implemented on 16nm CMOS technology. The designed 16x16

bit multiplier dissipates a power of 0.17mW. The propagation delay time of the proposed

architecture is 27.15ns. These results are many improvements over power dissipations and

delays as compared to the other architectures in literature. A further improvised structure

using just 2 carry save adders was proposed in [19] ,this paper presented the detailed study of

Page 8: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

8

Different multipliers based on Array Multiplier, Constant coefficient multiplication (KCM)

and multiplication based on vedic mathematics. All multipliers are then compared based on

LUTs (Look up table) and path delays. Results show that urdhva multiplier is the fastest

Multiplier with least path delay. A high speed multiplier architecture for multiplication of

two 8 bit numbers that includes the advantages of compressor based adders[20,21] and

the ancient Vedic maths methodology is proposed. The proposed multiplier is compared

with that of existing Urdhva multiplier and two other popular multipliers. After comparison

of its speed and area occupied by the multiplier it is deduced that the proposed

architecture of compressor based Vedic maths multiplier is better than conventional

multipliers used in several complex VLSI circuits.

This Vedic mathematics improves the performance of the multiplier in terms of speed.

Throughput of a multiplier can be increased by the use of pipelined registers [26]. By using

this technique RTL coding for 4×4 Vedic multipliers with and without Pipelining [22], is

compared. The area, delay, power analysis of multiplier performed in Cadence (rc). The

delay in the Pipelined architecture got reduced by 300ps.

Adders form an almost obligatory component of every contemporary integrated circuit. The

prerequisite of the adder is that it is primarily fast and secondarily efficient in terms of power

consumption and chip area. Therefore, careful optimization of the adder is of the greatest

importance. This optimization can be attained in two levels; it can be circuit or logic

optimization. In circuit optimization the size of transistors are manipulated, where as

in logic optimization the Boolean equations are rearranged (or manipulated) to

optimize speed, area and power consumption. The paper [23] focuses the optimization of

adder through technology independent mapping. The work presents 20 different logical

construction of 1-bit adder cell in CMOS logic and its performance is analyzed in terms

of transistor count, delay and power dissipation. These performance issues are analyzed

through Tanner EDA with TSMC MOSIS 250nm technology. From this analysis the

optimized equation is chosen to construct a full adder circuit in terms of multiplexer. This

logic optimized multiplexer based adders are incorporated in selected existing adders

like ripple carry adder, carry look-ahead adder, carry skip adder, carry select adder, carry

increment adder and carry save adder [25] and its performance is analyzed in terms of

area (slices used) and maximum combinational path delay as a function of size

Page 9: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

9

Chapter-3

VEDIC ALGORITHMS AND LOGIC OPTIMIZATION

3.1 History of Vedic Mathematics

Jagadguru Shankaracharya Bharati Krishna Teerthaji Maharaja (1884-1960) constructed

16 sutras (formulae) and 16 Upa sutras (sub formulae) after extensive research in Atharva

Veda[6,21]. Vedic mathematics has created wonder is mathematics. That is why vedic

mathametics has been a wide area of research. VM deals with numerous basic as well as

complex mathematical operations as it provides simple and yet powerful methods for fast

calculations.

The word “Vedic” is derived from the word “veda” which means the store-house of all

knowledge. Vedic mathematics is mainly based on 16 Sutras (or aphorisms) dealing with

various branches of mathematics like arithmetic, algebra, geometry etc[27]. These Sutras

along with their brief meanings are enlisted below alphabetically[21,,27].

1) (Anurupye) Shunyamanyat – If one is in ratio, the other is zero.

2) Chalana-Kalanabyham – Differences and Similarities.

3) Ekadhikina Purvena – By one more than the previous One.

4) Ekanyunena Purvena – By one less than the previous one.

5) Gunakasamuchyah – The factors of the sum is equal to the sum of the factors.

6) Gunitasamuchyah – The product of the sum is equal to the sum of the product.

7) Nikhilam Navatashcaramam Dashatah – All from 9 and last from 10.

8) Paraavartya Yojayet – Transpose and adjust.

9) Puranapuranabyham – By the completion or noncompletion.

10) Sankalana- vyavakalanabhyam – By addition and by subtraction.

11) Shesanyankena Charamena – The remainders by the last digit.

Page 10: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

10

12) Shunyam Saamyasamuccaye – When the sum is the same that sum is zero.

13) Sopaantyadvayamantyam – The ultimate and twice the penultimate.

14) Urdhva-tiryakbhyam – Vertically and crosswise.

15) Vyashtisamanstih – Part and Whole.

16) Yaavadunam – Whatever the extent of its deficiency.

These methods can be directly applied to trigonometry, geometry, calculus , and

applied mathematics. The beauty of Vedic mathematics lies in the fact that it reduces

the complex and tedious calculations in conventional maths to very simple ones[27]. This is

a very research intensive field and presents some of the most effective algorithms which can

be applied then be applied to digital signal processing operations..

The multiplier architectures are usually classified into following three categories. First is the

serial multiplier which focuses on minimum hardware and minimum amount of chip area.

Second is the parallel multiplier which focuses on delay reduction. But they have serious

drawback as they have larger chip area consumption. Third one being the serial- parallel

multiplier which offers good trade-off between the time consuming serial multiplier and the

area inefficient designing of parallel multipliers.

3.2 Vedic Mathematics Algorithms

Proposed Vedic multiplier is based on Vedic mathematics sutras (formulae). These Sutras

have been traditionally employed for the multiplication of two numbers following decimal

number system. In this work, the same sutras have been applied for binary number system to

reduce the complexity of the proposed algorithm. Few of the Vedic multiplication algorithms

have been discussed below:

3.2.1 Urdhvatiryakbhyam sutra

The multiplier is based on an algorithm Urdhva Tiryakbhyam (Vertical & Crosswise)

of ancient Indian Vedic Mathematics. Urdhva Tiryakbhyam Sutra is a recursive

multiplication process which literally means “Vertically and crosswise”. It is based on a

unique concept through which the generation of all partial products can be done in parallel

Page 11: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

11

fashion followed by the concurrent addition of these partial products[8]. The parallelism in

generation of partial products and their summation is obtained using Urdhava Triyakbhyam

explained in fig 2.1. The algorithm can be generalized for n x n bit number. Since the partial

products and their sums are calculated in parallel, the multiplier is independent of the clock

frequency of the processor[19]. Hence, the net advantage is the reduced need of

microprocessors to operate at increasingly high clock frequencies. While a higher clock

frequency results in increased power dissipation resulting in higher device operating

temperatures. By adopting the Vedic multiplier, microprocessors designers can easily

overcome these problems to avoid device failures[8]. The processing power of multiplier can

easily be increased by increasing the input and output data bus widths since it has a quite a

regular structure[27]. Due to its regular structure, it can be easily layout in a silicon

chip. The Multiplier has the a striking feature that as the number of bits increases, gate delay

and area increases very slowly as compared to other multipliers. Therefore it is time, space

and power efficient.

3.2.1.1 Multiplication of two decimal numbers 41x 53

Figure 3.1 illustrates the urdhva tiryakbhyam multiplication scheme of two numbers 43 and

51. In the first step LSB‟s of both the numbers are multiplied and this generates a result digit

and one carry digit. This carry is added to the next stage in step 2 where crosswise

multiplication is done. In each step, units digit acts as a result digit while higher digits act as

carry digit and the process goes on. The initial carry is taken as zero.

Page 12: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

12

Figure 3.1: Multiplication of two-digit decimal number using Urdhvatiryakbhyam sutra.

3.2.1.2 Urdhvatriyakbhyam multiplication for binary numbers.

Now we extend this Sutra to binary number system. To illustrate the multiplication algorithm,

let us consider the multiplication of two binary numbers X3X2X1X0 and Y3Y2Y1Y0. As the

result of this multiplication would be more than 4 bits, we express it as Line diagram for

multiplication of two 4-bit numbers is shown in Figure. 3.2[8]. For the sake of simplicity,

each bit is represented by a circle. Least significant bit P0 is obtained by multiplying the least

significant bits of the multiplicand and the multiplier. The process is followed according to

the steps shown in Fig. 3.2[8] As in the last case, the digits on the both sides of the line are

multiplied and added with the carry from the previous step. This generates one of the bits of

the result Pn and a carry (say Cn). This carry is added in the next step and hence the process

goes on. If more than one lines are there in one step, all the results are added to the previous

carry.

Page 13: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

13

Figure 3.2: Line diagram showing 4*4 multiplication using urdhvatriyakbhyam sutra[8]

In each step, least significant bit acts as the result bit and the other bits act as carry. For

example, if in some intermediate step, we get 110, then 0 will act as result bit and 11 as the

carry (referred to as Cn in this text). It should be clearly noted that Cn may be a multi-bit

number. Thus we get the following expressions:

P0 = X0Y0

C0P1 = X0Y1 + Y0X1

C1P2 = X0Y2 + X1Y1 + X2Y0 + C0

C2P3 = X3Y0 + X2Y1 + X1Y2 + Y3X0 + C1

C3P4 = X3Y1+ X2Y2 + X1YR + C2

C4P5 = X3Y2 + X2Y3 + C3

P7P6 = X3Y3 + C4

In this approach, we observe that as the number of bits increase the number of stages through

which the carry has to ripple also increases and hence, the delay increases. So, an alternative

approach can be effectively used to implement urdhva tiryakbhyam algorithm as shown in

Page 14: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

14

figure 3.3. This technique uses “divide and conquer” approach in which the 4*4

multiplication is divided into 4 2*2 bit multiplications that can be performed in parallel. This

parallel implementation style of Urdhva tiryakbhyam sutra reduces the number of logic levels

and thus reduces the delay of the multiplier and makes it a better and speedy implementation.

The beauty of that it divides Large bit stream (N bits) to small streams of bit length (N/2= n)

and the recursive process continues till we get the multiplicands of size 2, and then they are

multiplied in parallel, thus providing an increased operational speed.

Figure 3.3: Alternative Efficient Approach for Urdhvatiryakbhyam Binary Multiplication

Finally the 2*2 multiplication follows traditional “vertically or crosswise” technique as

shown in Figure 3.4 which is same as 2-digit decimal urdhva-tiryakbhyam multiplication. Let

2 numbers be x1x0 and y1y0 and final product is P3P2P1P0. Thus we get the following

expressions.

Figure 3.4: 2*2 bit vedic multiplication

P0 = x0 y0 (vertically) (1)

C0P1= x0y1 + x1y0 (crosswise) (2)

P3P2 = C0 + x1y1 (vertically) (3)

Page 15: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

15

3.2.2 Nikhilam Sutra

Nikhilam sutra effectively deals with large bit multiplications. Since it first finds out the

compliment of the large number from its nearest base to carry out the further multiplication

operation on it. Complexity involved in computation decreases to a large extent as sizes of

bit length increases. Nikhilam based multiplication is illustrated with the following

illustration. Consider the multiplication of two numbers (96 * 93) where the 100 is chosen as

the base since it is nearest to and greater than both these two numbers [8].

Figure 3.5: Nikhilam sutra multiplication illustration[8]

The right hand side (RHS) of the product is obtained by multiplying the numbers of

the Column 2 (complemented form) (7*4=28). The left hand side (LHS) of the product

can be found by cross subtracting the second number of Column 2 from the first number of

Column 1 or vice versa, i.e., 96 - 7 = 89 or 93 - 4 = 89. The final result is obtained by

concatenating RHS and LHS (Answer = 8928)[8].

3.3 Performance of Vedic Multiplier

3.3.1. Power

Vedic Multiplier requires less number of gates for given 8x8 bits Multiplier. Less number of

gates lead to less power dissipation and another factor responsible for low power

consumption is that the switching activity in case of vedic multiplier is less compared to array

Page 16: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

16

and booth multiplier which reduce the dynamic power. Hence, overall power dissipation

reduces.

3.3.2. Speed

Vedic multipliers are the faster than the existing array or booth encoding multiplier

architectures. As, we move towards higher order bits i.e. from 8x8 to 16x16 bits, there is a

sharp decrease in the timing delay in case of vedic multiplier architectures. Delay for 16x16

bit Vedic multiplier is 25 ns whereas; it is 37 ns for booth multiplier and 43 ns for array

multiplier [28]. This shrink in gate delays make the Vedic multipliers best suited for signal

processing and other operations. Vedic multiplier offer advantage, as the number of bits

increases the gate delay increases slowly in Vedic multipliers. Speed improvements are

attained by parallelizing the generation of partial products with their concurrent

additions.

3.3.3 Area

The area needed for Vedic multiplier is very small as compared to other multiplier

architectures i.e. the number of devices used in Vedic multiplier are 259 while Booth and

Array Multiplier is 592 and 495 respectively for 16 x 16 bit number when

implemented on Spartan FPGA [28]. Number of gates required for a Vedic multiplier is

small as compared to the booth and array multiplier. As, Number of gates reduce it brings a

substantial decrease in transistor count and uses less routing resources. Hence, the area

occupied by Vedic multiplier is very less.

Due to regularity in its structure, this architecture can be easily realized on silicon and can

work at high speed without increasing the clock frequency. It has the advantage that as the

number of bits increases area increase very slowly as compared to other multiplier

architectures

3.4 Logic Optimization

Multi-operand carry save adder (CSA) form the most obligatory component of the proposed

vedic multiplier. Since, the overall performance of any multiplier depends on the the

performance of adder. Hence, perquisite of this CSA is that it should be speedy and

secondarily it should be efficient in terms of power and chip area. Therefore, careful

Page 17: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

17

optimization of this CSA is the most essential task. This optimization can be achieved at 2

levels[31]:

1. Circuit level optimization: Circuit level optimization involves the manipulation

of circuits in terms of transistor sizing i.e. W/L aspect ratios of NMOS and

PMOS transistor is varied.

2. Logic level optimization: Boolean expression is rearranged and restructured by

simply deriving the logic form with from minimum number of literals. Hence,

reducing the chip area occupied by the design.

The work presented in the thesis deals with the logic optimization only. Reduction in

transistor count is a primary design criterion for the designing of modern digital circuits

which affects the design complexity of a multiplier. Hence, the dominant principle in digital

designing still hovers around reducing the cost and hardware of the circuit. The best solution

to achieve this is logic optimization. It involves the rearrangement of Boolean expression to

obtain fewest literals[23]. Since, the number of transistors which are required to implement a

logic expression is directly proportional to the number of literals. In other words, we require

implementing a minimum cost circuit to reduce the area.

Literal can be as, a given product term consists of some number of variables, each of which

may appear either in complemented or uncomplemented form. Each occurrence of these

variables either in complemented or in uncomplemented form is called literal. Reduction in

the number of literals leads to a cost minimized circuit. Here, cost represents the total gate

count plus the total number of inputs. Typically, logic optimization is done in 2 phases[23]:

1. Technology Independent Phase: In this phase logic is optimized by applying

various Boolean laws to simplify the expressions or factoring of the expression is

undertaken to overcome the major drawback of fan-in limitation[32]. Similarly,

complexity of logic circuit in terms of logic gates and wiring can also be reduced

by decomposing the circuit into small sub-circuits, these sub-circuits can then be

reused at several places in the circuit which reduces the redundancy of several

gates. Thereby, reducing the gate count and hence reduces the chip area

consequently, the power consumption reduces.

Page 18: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

18

2. Technology Dependent Phase: It takes into account the peculiarities and

properties of intended implementation architecture or the target technology on

which the circuit has to be implemented[23]. The technology independent

description resulting from the first phase is translated into a gate level netlist in

this phase. Technology dependent phase is not flexible as compared to the first

phase which facilitates restructuring and rebuilding of circuit logic to obtain

minimum number of nodes and literals to reduce area.

3.5 Pipelining Approach

Pipelining is the most prominent approach used in a wide variety of digital circuits to

improve the throughput of the logic modules long critical paths. It basically increases the

frequency of operation. This Increase in the frequency brings continuous shrink in the gate

delay. Minimal clock period assures correct evaluation. Minimal clock period is given by the

following expression[31]:

Tmin = TC-Q + Tpd,logic + Tsu

Pipelining technique decomposes a sequential process into sub processes and each sub

process is executed by a special dedicated segment that operates simultaneously with all the

other segments. A pipeline can also be visualized as a collection of processing modules

through which binary data flows. The result obtained from the computation in logic module is

transferred to the next logic block in the pipeline and so on as shown in the figure 3.6. In

simple words, this technique breaks a computationally complex block into discrete blocks

separated by clock storage elements like latches, flipflops, and registers. A clock is then

applied to all registers assure correct evaluation. Pipelining offers the following advantages:

1. High Throughput: Pipelining increases the functional throughput of the digital

system in manifolds as the introduction of registers between the logic blocks shortens

the maximum combinational delay of the circuit. These intermediate registers cause

the computation of single set of input data to be spread over a number of cycles. This

increased speed and throughput is brought at the expense of latency as gain in speed is

achieved by clocking the sub circuits faster and delay equalization is brought by

registers.

Page 19: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

19

Figure 3.6: Pipelining in a Logic Circuit

2. Resource Utilization: In conventional processes a single set of input data has to pass

through the series of combinational block and only when it passes through the entire

process the result is computed the next input is fetched. So in this case, even when the

processing by the initial logic blocks has been done they have to remain idle until the

entire process completes which leads to underutilization of resources. Whereas,

Pipelining increases the resource utilization. First instruction enters into the first pipe

now during the next clock cycle output from the first pipe is send to the next stage and

first pipe gets filled with next input set. This process goes on. Pipelining increases the

throughput but area overhead increases due to the involvement of number of registers.

Hence, the work presented uses this concept of pipelining in order to accomplish the high

performance of the proposed Multiply and accumulator unit (MAC).

Page 20: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

20

Chapter-4

EXISTING URDHVA MULTIPLIER ARCHITECTURES

Before introducing the proposed 32*32 bit Vedic multiplier, it would be necessary to ponder

upon the various pre-existing architectures of 32*32 bit Vedic multipliers based on

Urdhvatiryakbhyam sutra and show how the proposed architecture differs from the existing

structures. Four Architectures have been discussed below:

4.1 Basic 2x2 bit multiplier based on Urdhvatiryakbhyam (UT) algorithm

Urdhvatriyakbhyam multiplication algorithm method can be extended for binary numbers as

explained in the previous chapter. A simple 1 bit binary multiplication is done by performing

the logical AND operation between the numbers. Using this and UT method 2X2

multiplication for x1x0 and y1y0 is implemented by simply using 2 half adders and resultant

bits are P3P2P1P0 as shown in Fig. 4.1. The equations regarding this are given below:

P0 = x0 y0 (vertically) (1)

C0P1= x0y1 + x1y0 (crosswise) (2)

P3P2 = C0 + x1y1 (vertically) (3)

Figure 4.1: Block diagram of 2*2 bit vedic multiplier

Page 21: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

21

Higher binary multiplications can also be obtained with the help of lower multiplication units

and the adder unit.

4.2 Efficient 4*4 Vedic multiplication based on Urdhvatiryakbhyam algorithm

Let us assume the multiplicand A=A3A2A1A0 and multiplier B= B3B2B1B0 and the output

be =S7S6S5S4S3S2S1S0. Let‟s divide A and B into two parts, say “A3 A2” & “A1 A0”

for A and “B3 B2” & “B1B0” for B. Using the fundamental of Vedic multiplication,

taking two bit at a time and using 2*2 bit multiplier block as shown in 4.1, we can

have the following structure for 4x4 bit multiplication as shown in figure 4.3[11].

Figure 4.2: Structure of 4*4 Multiplication[11]

Each block as shown above is 2x2 bit multiplier generating partial products in the same

manner as discussed above. First 2x2 multiplier inputs are “A1 A0” and “B1 B0”. The last

block is 2x2 bit multiplier with inputs “A3 A2” and “B3 B2”. The middle one shows

two, 2x2 bit multiplier with inputs “A3A2”& “B1B0” and “A1A0” & “B3B2”.Once the

Partial products from all the four 2x2 bit multipliers are generated they are added up using

suitable N-bit adders to obtain the final result which is of 8 bit, “S7S6S5S4S3S2 S1S0”.

So in a generalized way we can say that, The individual multiplication products are obtained

by same recursive partitioning method, Large bit streams of NXN bit numbers are divided

Page 22: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

22

into bit streams of length N/2 which are further divided into bit streams of length N/4 till we

get the multiplicands of size 2X2 and ultimately using the 2X2 bit multiplication method as

shown in figure 4.1 to obtain final product. For NXN multiplication unit, we require four N/2

bit multipliers, N bit full adders as shown in Figure 4.3. Speed of multiplier ultimately

depends on the speed of adder used in the circuit.

Figure 4.3: Generalized NXN Bit Urdhvatriyakbhyam Multiplication Block Diagram

4.3. Existing Approaches for High Speed Urdhvatiryakbhyam Multiplier

4.3.1 Conventional 32X32 Bit Vedic Multiplier Using RCA (Architecture 1)

Pushpalata [11] proposed architecture with the ripple carry adder in Vedic multiplication unit

for 4X4 bit multiplication. This architecture can be extended for higher bits like 8, 16, 32 bit

multiplications. The 4X4 multiplier is implemented using 2X2 multiplier unit and Ripple

carry adder as shown in Figure 4.4.

Three 4-bit Ripple Carry (RC) Adders are required. In this proposal, the first 4-bit RC

Page 23: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

23

Adder is used to add two 4-bit operands obtained from cross multiplication of the two

middle 2x2 bit multiplier modules. The second 4-bit RC Adder is used to add two 4-bit

operands, i.e. concatenated 4-bit and one 4-bit operand we get as the output sum of first RC

Adder. Its carry “ca1” is forwarded to third RC Adder. Now the third 4-bit RC Adder is used

to add two 4-bit operands, i.e. concatenated 4 -bit (carry ca1, “0” & most significant two

output sum bits of 2ndRC Adder as shown in Figure 4.4) and one 4-bit operand we get as

the output sum of left hand most of 2x2 multiplier module.

Three 4-bit ripple carry adders are used and the combinational path delay is found to be

13.102 ns. Results are compared with Array and Booth Multiplier and it is observed that the

execution time has been reduced for Vedic multiplier and thus proves to be better[11,21].

Figure 4.4: Block Diagram of Conventional 4-Bit Urdhva Multiplier[11]

The Hardware implementation of 32 * 32 bit conventional Urdhvatriyakbhyam multiplier as

shown in figure 4.5 is the extension of 4*4 multiplier shown in Figure 4.4.

The architecture is decomposed into following 2 major blocks:

Page 24: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

24

1. 16 * 16 bit conventional vedic multiplier

2. 32- bit ripple carry adder (RCA)

Figure 4.5: Conventional 32*32 Bit Urdhva Multiplier Architecture

4.3.1.1 16x16 bit Conventional Vedic multiplier

The 32 -bit input bit streams of multiplicand A(31-0) and multiplier B(31-0) are divided into

two equal bit stream of 16 bit length. Four 16* 16 bit conventional multipliers are used as

shown in figure 4.5 to generate partial products using urdhvatriyakbhyam technique in the

fashion similar to that of 4*4 conventional vedic multiplier as explained earlier.

4.3.1.2 Ripple Carry Adder (RCA)

Here, three 32-bit ripple carry adders are used to add up the partial products and generate the

final product P(63-0). N-bit ( here N=32) ripple carry adder consists of N-1 full adder and 1

half adder. This adder is also named as parallel adder because these full and half adders are

arranged in parallel in such a way that each adder unit generates a sum bit and carry bit. The

Page 25: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

25

sum bit is taken as resultant bit and carry is transmitted to next adder unit as an input. RCA

are the simplest adders with compact layout and lowest power consumption but they are not

efficient for large bit numbers. Since, the delay of RCA increases linearly with increase in the

lengths of addend and augends. One of the major factor affecting its speed is that it has to

wait for the carry to ripple from previous state to next state for its further operation. Hence,

its speed gets restricted which eventually restricts the speed of multiplier. Area complexity of

RCA is O(n) and Time Complexity O(n) where N is the operand size in bits.

When this 32*32 bit Conventional multiplier was synthesized in Xillinx 12.4 and simulated

in ISIM. Combinational delay of this approach was found to 42.268 ns.

4.3.2 Modified 32X32 bit Vedic multiplier (Architecture 2)

The 4x4 multiplier comprises of 4, 2x2 bit vedic urdhva multipliers as explained earlier.

Multiplicands are of size 4, and result is of 8-bit length. The inputs A and B are broken into

chunks of size N/2 i.e. 2 bits. These 2 bit chunks are given as an input to the 2x2 bit

multipliers to generate the partial products of size 4-bits. These outputs are then sent for

addition to addition tree as shown in figure 4.6[29].

Modified Wallace tree look alike addition which reduces the levels of addition to 2 instead of

3[29].LSB 2-bits of q0 are the resultant bits while MSB two bits of qo are sent to the tree for

further addition. This design is found to be more area efficient than the conventional design.

Figure 4.6: Modified 4X4 Bit Vedic Multiplier Architecture 2[29].

Page 26: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

26

This architecture can be extended for 8, 16, 32 bit multiplications. Figure 4.7 shows the

modified Architecture for 32*32 bit vedic multiplication. We have, Four 16x16 bit

multipliers and 3 adders of variable size. 32-bit multiplicands are divided into N/2=16 bit

chunks for both A and B. Now, the 16-bit binary input streams are sent to 16x16 bit

multiplier where, 16 bit streams are partitioned to 8 bits followed by the groups even more

smaller chunks of size four bits the process ends up with chunks of size 2 bits. Which are

then, sent to 2x2 multiplier block. Finally, Partial products of size 32 bits are obtained as the

output of 16x16 bit vedic multipliers. Which are then sent to the Wallace tree like structure

for further addition and final product generation.

This structure when simulated and synthesized using xillinx 12.4 and ISIM gave following

results for various operand sizes

Figure 4.7: Modified 32*32 Bit Vedic Multiplier 2

4.3.3 32*32 bit Vedic Multiplier using Carry Save Adder (Architecture 3)

Yet another, Architecture of 8x8 multiplier was proposed in [19] as shown in the block

diagram in Fig. 4.8. It can be easily implemented by using four 4x4 bit Vedic multiplier

modules as discussed in the previously. Let‟s analyze 8x8 multiplications, say A= A7 A6 A5

A4 A3 A2 A1 A0 and B= B7 B6 B5B4 B3 B2 B1B0. The output line for the multiplication

result will be of 16 bits as P = S15 S14 S13 S12 S11 S10 S9 S8 S7 S6S5S4 S3 S2 S1 S0. In

Page 27: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

27

this figure the 8 bit multiplicand A can be decomposed into pair of 4 bits AH-AL. Similarly

multiplicand B can be decomposed into BH-BL. The 16 bit product can be written as:

P = AxB = (AH-AL)x(BH-BL) =AHxBH + AHxBL + ALxBH + (AHxBH)

the fundamental of Vedic multiplication, taking four bits at a time and using 4 bit multiplier

block as discussed we can perform the multiplication. The outputs of 4X4 bit multipliers are

added accordingly to obtain the final product. Thus, in the final stage two adders are also

required. Here four 4x4 Bit Vedic Multiplier and two carry save adder is used to implement

the 8x8 Vedic multiplier[19].

Figure 4.8: 8x8 Vedic Multiplier Module Using Carry Save Adder architecture 3 [19]

Above architecture can be extended to 32x32 bit multiplication as shown in figure 4.9.

Hardware implementation of 32x32 bit multiplier using carry save adder is divided in 3

blocks.

1. 16x16 bit vedic multiplier

2. 2-operand carry save adder

3. 3-operand carry save adder.

Page 28: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

28

Figure 4.9: 32 Bit Multiplier using CSA[19]

The outputs of 16X16 bit multipliers are added accordingly to obtain the 64 bits final product.

Thus, in the final stage two adders are also required. First carry save adder adds 3 operands.

4-bit MSB from the first multiplier concatenated with 4 zeroes is added to the partial products

generated from second and third 4x4 bit multiplier. LSB 4 bits of the sum generated from this

carry save adder become the resultant bits of the products. MSB 4-bits are added to the partial

product generated from the fourth multiplier by another 2-operand carry save adder (CSA).

The output of this CSA gives the final resultant.

This architecture when simulated and synthesized using Xilinx 12.4 and ISIM.

Combinational delay was found to be 60.436 ns. But the area occupied by this design was

2052 slices out of 3584 which is quite less than the architecture 1 and architecture 2. The

memory utilization of this architecture is 187364 Kb which increases the space complexity of

this architecture

4.3.4 32*32 Bit Low Power Vedic Multiplier Using CLA (Architecture 4)

Fast and low power 16-bit multiplier architecture was proposed by R.K, R.S, S. Sarkar, and

Rajesh [18] replacing ripple carry adder with the carry Look-ahead adder (CLA) as in Fig

4.10. Since the carry is generated in advance in this adder, it decreases the carry propagation

time and thus this architecture improves the operational speed. The first step in the design of

16×16 block will be grouping the 8 bit (byte) of each 16 bit input. These lower and upper

bytes pairs of two inputs will form vertical and crosswise product terms. Each input byte

Page 29: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

29

is handled by a separate 8×8 Vedic multiplier to produce sixteen partial product rows.

These partial products rows are then added in a 16-bit carry look ahead adder

optimally to generate final product bits .The figure 4.10 shows the schematic of a 16×16

block designed using 8×8 blocks.

Figure 4.10: Low Power and High Speed Vedic Multipliers Using CLA architecture 4[18]

The 16x16 bit multiplier has reduced combinational delay of 27.148 ns and consumes 0.169

watts of power [18] which is quite less compared to all the existing multipliers.

Page 30: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

30

But The above architecture can be extended up to 32x 32 bit multiplication also as shown in

figure 4.11.

Figure 4.11: Low Power & High Speed 32x32 Bit Vedic Multiplier using CLA architecture 4

Page 31: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

31

Chapter-5

DESIGN IMPLEMENTATION AND SYNTHESIS

5.1 Proposed 32-Bit Multiplier and Accumulator Unit (MAC)

Multiplication and accumulation are the two most tedious and computational intensive

operations in and DSP architecture. Hence, it‟s the MAC unit which decides the overall

performance most specifically the speed of the system as it is present in the critical path.

Therefore, development of high speed MAC is very crucial. Thesis concentrates on 2 major

bottlenecks of the MAC i.e. Fast multiplication network and Fast accumulator. As both these

stages require adding up of large operands which consequently, involve long carry

propagation paths which is the major factor in determining the speed of MAC.

Figure 5.1: Proposed MAC architecture

Page 32: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

32

So, the proposed MAC as shown in figure 5.1 uses urdhva Vedic multiplication network as it

has been identified as the fastest algorithm for multiplication. Accumulator is based on Han

Carlson parallel prefix adder which has proved to be the fastest parallel prefix adders.

Proposed MAC has two basic blocks:

1. BEC-1 based 32X32 bit Vedic multiplier

2. Han-Carlson adder based accumulator.

5.2 Proposed Vedic Multiplier Architecture

The proposed high throughput Vedic multiplier comprises of following sub modules

1. Vedic multiplier

2. Multi-operand carry save adder

3. Binary to excess-1 code converter

4. Multiplexer

All these modules have been coded in VHDL.

5.2.1 Vedic multiplier

Proposed Vedic multiplier design is based on “Urdhvatiryakbhyam sutra” as it is the most

preferred technique amongst all the Vedic multiplication techniques. It simply means

“vertically or crosswise”. Its algebraic principle is based on the multiplication of

polynomials. Vedic multiplier using UT will generate 2N-1 cross products of different width

which when combined form log2N + 1 partial product. UT offers an advantage of parallel

generation of partial products and their concurrent addition.

This property of generating parallel partial product offers various advantages. Firstly, it

makes the multiplier independent of the clock frequency of the processor which leads to less

power dissipation. Hence, the design becomes energy efficient. Secondly, it can be

implemented with reduced number of gates, full adders, half adders. Also, due to the compact

and regular architecture of this multiplier it can be easily layout in a silicon chip. Thirdly,

since the partial products are obtained by vertical or crosswise operation. Hence, delay is

equal to the delay of adder or in other words, critical path would consist of adders adding

maximum number of bits in cross product.

Page 33: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

33

The multiplier has a unique advantage that with the increase in the number of bits its

proportionate increase in area and gate delay is quite low in comparison to the other

multipliers. Therefore, it is time, area and power efficient. There is striking difference in

method of computation of Vedic multiplier as it handles an array of large numbers (NXN)

bits by dividing them into small [N/2 = n] and these numbers are divided into further small

numbers (size = n/2), these numbers are further broken into smaller chunks till we get, the

numbers of multiplicands of size 2x2 each for parallel generation of partial products. Thus, it

performs multiplication with minimum number of steps which in-turn reduces the delay and

hardware requirements

So, 2x2 bit vedic multipliers form the fundamental building blocks, using these 4x4

multipliers are made then, using this 8x8 bit multipliers are followed by 16x16 bit multipliers

and finally 32x32 bit multipliers are made. The device selected for synthesis is Spartan 3,

XC3S400 package PQ208 and speed grade -5.

So, let‟s start from the synthesis of 2x2 bit vedic multiplier.

5.2.1.1 2x2 Bit Vedic multiplier

A simple 1 bit binary multiplication is done by performing the logical AND operation

between the numbers. Using this and UT method 2X2 multiplication for x1x0 and y1y0 is

implemented by simply using 2 half adders and resultant bits are P3P2P1P0 as shown in Fig.

5.2. The equations regarding this are given below:

P0 = x0 y0 (vertically) (1)

C0P1= x0y1 + x1y0 (crosswise) (2)

P3P2 = C0 + x1y1 (vertically) (3)

In 2x2 multiplier the input range of multiplicands is (00-11) and output lies in a set of (0000,

0001, 0010, 0011, 0100, 0110, 1001). Consider the following illustration let the multiplicand

a and b be 11 and 10. In the first step, vertical multiplication of both the LSB is done. In the

second step, Crosswise multiplication and addition of partial product takes place. Third step,

carries out the vertical multiplication of MSB of both multiplicand.

Page 34: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

34

Figure 5.2(a)

Figure 5.2(b)

Figure 5.2 (a) and (b): Illustration and Hardware implementation of 2x2 UT multiplier

5.2.1.2 Proposed 4x4 Bit Vedic Multiplier

The 4x4 proposed multiplier consists of 4, 2x2 Urdhvatiryakbhyam multiplier blocks. Here,

the multiplicands A and B are of size 4-bits each and the resultant product is of size 8-bits.

The input bit streams A and B are broken into smaller chunks of size n/2 = 2, These newly

created chunks are then given as an input to the 2x2 multiplier block which generate 4-bit

results. Output of these 2x2 multiplier block is then sent to 4-bit multi-operand carry save

adder (CSA) as shown in figure 5.3 for the summation of partial product. The sum-bit of CSA

Page 35: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

35

become the resultant bits whereas, the carry bit of CSA acts as selection line of multiplexer.

If carry= „1‟ then, output of binary to excess-1(BEC-1) is taken to be the MSB bits of

resultant product whereas, if carry = „0‟ then the MSB 2-bits of the result generated by the

vertical multiplication of MSBs of A and B form the resultant bit.

Figure 5.3: Proposed 4x4 Bit Vedic Multiplier

The addition can be explained by the following diagram in figure 5.4. q0, q1, q2, q3 be the

partial products generated by the 2x2 multiplier blocks and p(7-0) be the sum.

Figure 5.4: Addition of Partial Products In 4x4 Block

Page 36: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

36

5.2.1.3 Proposed 8x8 Bit Vedic Multiplier

The 8x8 proposed multiplier consists of 4, 4x4 proposed Urdhvatiryakbhyam multiplier

blocks. Here, the multiplicands A and B are of size 8-bits each and the resultant product is of

size 16-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 4,

these newly created chunks are then given as an input to the 4x4 multiplier block, where

again these new chunks are broken into even smaller bit streams of size n/4 =2 and fed to 2x2

multiply block. Finally, each 4x4 bit multiplier block generates 8-bit results. Output of these

4x4 multiplier block is then sent to 8-bit multi-operand carry save adder (CSA) as shown in

figure 5.5 for the summation of partial product. The sum-bit of CSA become the resultant bits

whereas, the carry bit of CSA acts as selection line of multiplexer. If carry= „1‟ then, output

of binary to excess-1(BEC-1) is taken to be the MSB bits of resultant product whereas, if

carry = „0‟ then the MSB 4-bits of the result generated by the vertical multiplication of MSBs

of A and B form the resultant bit.

Figure 5.5: Proposed 8x8 bit Vedic multiplier

The addition can be explained by the following diagram in figure 5.6. q0, q1, q2, q3 be the

partial products generated by the 4x4 multiplier blocks and p(15-0) be the sum.

Page 37: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

37

Figure 5.6: Addition of Partial Products In 8x8 Block

5.2.1.4 Proposed 16x16 Bit Vedic Multiplier

The 16x16 proposed multiplier consists of 4, 8x8 proposed Urdhvatiryakbhyam multiplier

blocks. Here, the multiplicands A and B are of size 16-bits each and the resultant product is

of size 32-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 8,

these newly created chunks are then given as an input to the 8x8 multiplier block, where

again these new chunks are broken into even smaller bit streams of size n/4 =4 and fed to 4x4

multiply block just as in case of 8x8 multiply block. Again new chunks are divided into half

to get chunks of size 2, which are fed to 2x2 multiply block. Finally, each 8x8 bit multiplier

block generates 16-bit results. Output of these 8x8 multiplier block is then sent to 16-bit

multi-operand carry save adder (CSA) as shown in figure 5.7 for the summation of partial

product. The sum-bit of CSA become the resultant bits whereas, the carry bit of CSA acts as

selection line of multiplexer. If carry= „1‟ then, output of binary to excess-1(BEC-1) is taken

to be the MSB bits of resultant product whereas, if carry = „0‟ then the MSB 8- bits of the

result generated by the vertical multiplication of MSBs of A and B form the resultant bit.

Page 38: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

38

Figure 5.7: Proposed 16x16 Bit Vedic Multiplier

The addition can be explained by the following diagram in figure 5.8. q0, q1, q2, q3 be the

partial products generated by the 8x8 multiplier blocks and p(31-0) be the sum

Figure 5.8: Addition of Partial Products In 16x16 Block

5.2.1.5 Proposed 32 x 32 Bit Vedic Multiplier

The 32x32 proposed multiplier consists of 4, 16x16 proposed Urdhvatiryakbhyam multiplier

blocks. Here, the multiplicands A and B are of size 16-bits each and the resultant product is

of size 63-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 16,

Page 39: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

39

these newly created chunks are then given as an input to the 16x16 multiplier block, where

again these new chunks are broken into even smaller bit streams of size n/4 =8 and fed to 8x8

multiply block just as in case of 16x16 multiply block. Further these chunks are divided to

get bit streams of size 4, which are given as an input to 4x4 block. Again new chunks are

divided into half to get chunks of size 2, which are fed to 2x2 multiply block. Finally, each

8x8 bit multiplier block generates 16-bit results. Output of these 16x16 multiplier block is

then sent to 32-bit multi-operand carry save adder (CSA) as shown in figure 5.9 for the

summation of partial product. The sum-bit of CSA become the resultant bits whereas, the

carry bit of CSA acts as selection line of multiplexer. If carry= „1‟ then, output of binary to

excess-1(BEC-1) is taken to be the MSB bits of resultant product whereas, if carry = „0‟ then

the MSB 16- bits of the result generated by the vertical multiplication of MSBs of A and B

form the resultant bit.

Figure 5.9: Proposed 32x32 bit Vedic multiplier

Page 40: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

40

5.2.2 Conditional Binary to Excess-1 Code Converter (BEC-1)

The proposed BEC-1 vedic multiplier architecture employs Binary to excess-1 converter as

its vital element. The major purpose of using BEC-1 is to get lower area and improved speed

of operation. This logic can be implemented for different bits which are used in the modified

design. The major advantage of this BEC logic stems from the following facts

1. It uses lesser number of logic gates than the n-bit Full Adder (FA) structure. So,

when compared to the conventional architecture we see that 32-bit ripple carry adder

in figure 4.5 is replaced by BEC-1 in the proposed design. Hence, the area occupied

by the proposed design is much less. Structure of 4-bit BEC-1 is shown in figure

5.10[30]. RTL View of BEC-1 is shown in Figure 5.11.

Figure 5.10: BEC-1 Block Diagram[30]

Page 41: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

41

Figure 5.11 RTL View of BEC-1

2. BEC-1 in the proposed design is used so as to generate two possible partial results in

parallel. A multiplexer is used to select either the BEC-1 output or direct input

according to the selection line as shown in figure 5.12[30]. Since, the results are

generated in parallel therefore; the delay of proposed design is reduced.

Figure 5.12 BEC-1 with Multiplexer[30]

Page 42: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

42

Unlike the existing architecture 3 which uses 2-operand CSA which has to wait for the

carry generated by the preceding 3-operand CSA. Once the Carry is generated only after

that it can add up the carry to generate the final result. Similarly, architecture #4 uses half

adder assembly which again has to wait for the carry to be generated from the preceding

CLA units. Once, the carry is generated then only it adds up the carry to get the final

product.

Hence, the use of BEC-1 in proposed vedic multiplier fulfills both our motivations of reduced

area and reduced delay.

5.2.3 Multi-Operand Carry Save Adder

Carry save adder (CSA) is a peculiar design for speedy multi-operand adders. A CSA

consists of a ladder of stand-alone full adder circuits or in other words, a CSA comprises of

n-disjoint full adders each of which individually computes the sum and carry bit for

corresponding 3 input bits. So, basically it has two main units as shown in figure 5.13:

Figure 5.13 K-Bit Carry Save Adder Adding 3 K-Bit Operands

1. CSA unit : It consists of n- disjoint full adders which take up 3 n-bit binary input

numbers and produces two outputs i.e. n-bit partial sum and n-bit partial carry.

Therefore, it reduces the addition of three numbers to two-numbers also known as 3:2

Page 43: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

43

counter for the same reason. The expression governing carry save adder is given by

the following equation :

A + B + C = Sum + 2 * Carry

2. CPA unit: CPA stands for carry propagation adder unit. The resultant sum of the

carry save adder is obtained by adding n-bit partial sum and n-bit partial carry. We

can use either ripple carry adder of the carry look-ahead adder as CPA in order to

compute the final sum.

Carry save adder has been used in proposed Vedic multiplier as a compressor circuit to add

the partial products. Figure 5.14 makes the use of CSA in the proposed technique quite clear.

8X8 bit multiplication of 11111111 x 11111111 has been illustrated. The 3:2 Compressor in

the proposed technique offers the following advantages over the adder chains present in the

existing architectures.

Figure 5.14: Illustration of 8x8 proposed vedic multiplier technique.

1. Delay: Speed enhancement of circuit requires minimal carry propagation. Unlike

the other adders where to add 3 operands, 2 carry propagation chains are required, the

CSA focuses on parallel partial sum and carry generation. In which each column adds

up without waiting for the carry from the previous stage. Hence, time complexity of

CSA unit is O(1), where delay is equal to the delay of a single full adder circuit. It

Page 44: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

44

has been implemented and analyzed that addition of 3 numbers using CSA and ripple

carry adder is much faster than the addition using two ripple carry adders as in

conventional vedic multiplier architecture 1 as the complexity of 2 RCA‟s is 2n ( n is

operand size) whereas, for a CSA followed by RCA its (n+1). Table 5.1 shows area

and time complexity of CSA, CLA, and RCA.

TABLE 5.1

AREA AND TIME COMPLEXITIES OF VARIOUS ADDERS

ADDER AREA

COMPLEXITY

TIME

COMPLEXITY

Ripple Carry Adder O(n) O(n)

Carry look ahead adder O(n logn) O(logn)

3-operand carry save

adder

O(n) O(logn)

Since, Ripple carry adders are relatively slow adders as each full adder has to wait

for carry from the previous stage so its complexity is O(n) we can use a carry look

ahead adder instead so that complexity of multi-operand adder further reduces to

(logn + 1). Also it has been analyzed from the synthesis results that the CSA in the

proposed architecture reduced the combinational delay to 15.553 ns as against the use

of 2 CLA‟s in architecture 4 discussed earlier which takes 20ns.

2. Area: Area occupied by the CSA is given by the following equation:

ACSA = N X AFA + ACPA

It is observed that the use of carry look ahead adders in architecture #2 and #3

increases the number of slices occupied by the architecture as the area occupied by

CLA is more than RCA as area complexity of CLA IS O (nlogn) whereas, for RCA

it is O(n). Hence, we see that in the proposed design there is 37.5% reduction in the

number of slices utilized by the proposed design as compared to architecture 4.

Page 45: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

45

5.2.4 Flowchart of Proposed Vedic Multiplier Architecture For NXN Bits

Following flowchart explains the architecture of proposed multiplier when extended to NXN

bit multiplication

Figure 5.15: Flowchart of NXN Bit Proposed Multiplier

5.3 Accumulator

MAC performs the multiplication of 2 numbers i.e. the multiplier and the multiplicand and

add up the product obtained to the result stores in the accumulator to attain the final result.

Output of the register is fed back as an input to the adder. Then further on each cycle the

output of the multiplier is added the contents present in the accumulator. Figure 5.16 shows

the accumulator unit. Since, in this case, accumulator has to perform addition of large

operand of size 64 bits which involves long carry propagation chains. Hence fast adders need

Page 46: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

46

to be implemented to improvise the performance of whole MAC unit.

Figure 5.16: Block Diagram of Accumulator

5.3.1 Adder design

MAC requires fast response adders. Hence, implementation of accumulators with

conventional adders like ripple carry adders (RCA), regular carry look Ahead (CLA) adders

will degrade the performance of the MAC due to the following reasons:

1. RCA offers a serious drawback of increased delay with increase in the number of bits

as each of the consecutive full adder has to wait for the carry from the previous stage.

Since, it takes longer time for carry propagation therefore, if used in MAC it will

reduce the speed.

2. CLA is a fast adder in contrast to RCA as it calculates the carry in advance, by simply

looking at the input bits which results in reduced carry propagation delay. But the use

of CLA is restricted only to smaller width adders because for wider bit adders there

will be substantial loading capacitance and therefore, large delay, large power

consumption.

So, in order to attain high performance, implementation of MAC with parallel prefix adders

(PPA) can be one of the most promising solutions in contrast to the above mentioned serious

Carry

Page 47: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

47

drawbacks.

5.3.1.1 Parallel Prefix Adders

Parallel prefix adders are considered to be one of the fastest adders which are flexible enough

for VLSI implementation, they perform the high speed additions by pre-computing generate

and propagate signals. PPA operation has following 3 stages[16]:

1. Pre-Processing stage: - This stage involves the computation of generate and

propagate signals are used to generate carry input of each adder. A and B are inputs.

These signals are given by the equation 1&2[17]

Pi = Ai Bi…………...…………………………………………………………… (1)

Gi = Ai and Bi……………………………….…………………………………...… (2)

2. Carry Generation Network: - Carry corresponding to each bit is computed in this

stage. Generate and Propagate signals are treated as intermediate signals. Parallel

execution of carry is done. It is governed by the equations 3 & 4 [17]

P(i:k)=P(i:j).P(j1:k)……………………………………………...…………………...(3)

G(i:k) =G(i:j) +(G(j-1:k) . P(i:j))……………….....…………………………….…...(4)

3. Post Processing Stage: - This stage computes the final sum and carry bits following

the equations 5 & 6 [17].

Si = Pi Ci…………………………………………..……………………….……. (5)

Ci+1=(Pi .C0) + Gi…………………..……………………..………………………. (6)

Most commonly used PPA are Brent and Kung, Kogge Stone and Han Carlson adders. Kogge

Stone adder is the fastest reported PPA which generates carry signal in O(logn) time[31] as it

has minimum logic depth and minimum fan out. But this increase in speed comes with

increase in area. Brent and kung adder are area efficient but they have occupy less routing

area O(n logn) but it is slower than Kogge stone adder with T= 2logn -1[31].

Proposed MAC uses Han Carlson adder as it has a compromise between Kogge stone adder

and brent kung adder. It uses brent kung stage at the beginning followed bt Kogge stone stage

and again uses another bren kung stage at the termination. It generates carry in T= logn +1

time and occupies area O(n logn)[17]. Figure5.17 [16] shows 32 -bit HAN Carlson adder.

Page 48: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

48

Figure 5.17: 16-bit Han Carlson Adder[16]

Implementation of MAC unit with Han Carlson adder proves to be faster than the hybrid

carry select adder which is being used very frequently now-a-days for MAC implementation.

This increase in speed comes with increase in area as examined from the analysis as shown in

chapter 6.

5.4 Logic Optimization of CSA-Multiplexer Based Approach

The work presented deals with technology independent logic optimization for multi-operand

carry save adder. Since, a multi-operand CSA is a ladder of multiple stand alone full adders.

Hence, the optimization of full adder is the major step in the logic optimization of CSA.

5.4.1 Full-Adder Implementation Methodologies

Full adder being the most obligatory component of the CSA needs a careful optimization.

Full adder is a circuit which adds 3 bits: two inputs A, B and carry from previous stage and it

produces two outputs SUM and output CARRY. The work presents 10 different logic

constructs of full adder as shown below.

5.4.1.1 1-Bit Adder Using XOR, AND, OR (Method 1)

General expression for Full adder is:

Page 49: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

49

SUM = A B CIN

CARRY = A.B + B.CIN + CIN.A

Figure 5.18: Full Adder by Method 1

5.4.1.2 1-Bit Adder using NAND (Method 2)

SUM = A B CIN

CARRY = A.B + B.CIN + CIN.A

Figure 5.19: Full adder using NAND gates

5.4.1.3 1-Bit Adder Using NOR (Method 3)

SUM = A B CIN

CARRY = A.B + B.CIN + CIN.A

Page 50: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

50

Figure 5.20: 1-Bit Adder Using NOR

5.4.1.4 1-Bit Adder Using XOR, NOR, NOT, OR (Method 4)

SUM= A B CIN

CARRY =

Figure 5.21: 1-Bit Adder using Method 4

5.4.1.5 Full Adder Using XNOR, NOT, AND, OR (Method 5)

SUM = (A B CIN)‟

CARRY = AB + ((A B)‟)‟CIN

Page 51: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

51

Figure 5.22: 1-Bit Adder using Method 5

5.4.1.6 Full Adder Using XOR, AND (Method 6)

SUM = (A B CIN)

CARRY = (A B) CIN AB

Figure 5.23 1-Bit Adder using Method 6

5.4.1.7 Full Adder Using XOR, NAND, NOR (Method 7)

SUM = (A B CIN)

CARRY = ((AB)‟ + (BCIN)‟ + (CINA)‟)‟

Page 52: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

52

Figure 5.24 1-Bit Adder using Method 7

5.4.1.8 Full Adder Using XOR, NAND, NOT, NOR (Method 8)

SUM = (A B CIN)

CARRY = ((AB‟).(((A+B)‟)‟.CIN)‟)‟

Figure 5.25 1-Bit Adder using Method 8

5.4.1.9 Full Adder Using XOR, XNOR, MUX (Method 9)

Page 53: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

53

Figure 5.26: 1-bit adder using method 9

5.4.1.10 Full adder using MUX (Method 10)

(a) (b)

Figure 5.27: MUX-based full adder (a) RTL View (b) Block Diagram

Full adder with each of the above 10 methodologies has been incorporated in Carry save

adders of various operand sizes 4, 8, 16, 32 bits and the performance of multi-operand carry

save adder has been analyzed in terms of Cost, slice utilization and maximum combinational

delay. From the analysis of carry save adder using all the 10 logic construct the optimized

Page 54: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

54

expression of full adder in terms of multiplexer has been proved to be the best. Hence, logic

optimized multiplexer based carry save adders have been employed in the proposed Vedic

multiplier to improve its overall performance. RTL schematic of CSA in terms of multiplexer

has been shown in figure 5.28

Figure 5.28: RTL View of Multi-operand Carry Save Adder

5.5 Pipelining of Proposed Vedic Multiplier

In order to reduce the combinational delay of the proposed 32x32 bit Vedic multiplier

concept of pipelining has been used in the thesis. The 32x32 bit complex multiplication

process is divided into small chunks of 16x16 multiplication which is further divided into 8x8

bit multiplication which in turn is broken into 4x4 multiplication , and this 4x4 multiplication

is divided into 2x2 multiplications. Now, each of these multiplications is performed by

special dedicated segments which are operating concurrently. The 2x2, 4x4, 8x8, 16x16

multiply blocks are separated by parallel in parallel out (PIPO) registers as shown in

figure5.29.

Page 55: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

55

1. During the first clock the data enters to 2x2 multiply block processes the data and

stores the data into first pipe.

2. On the second clock cycle, 4x4 multiply block receives the input from first pipe

process the data and stores the data in second pipe.

Figure 5.29: Pipelining in 32x32 Bit Proposed Multiplier

3. Now, during the third clock cycle 8x8 combinational block receives the input from

second pipe produces the partial product and stores the data into third pipe.

4. During the fourth clock period, data from third pipe is fed as an input to 16x16 logic

block which in turn produces the partial product and fourth pipe holds this data.

5. Now, on occurrence of fifth clock period data enters to 32x32 multiply block and

final product is generated. So, we get the output at the end of fifth cycle.

Advantage of pipeline becomes quite apparent when minimal clock period of the proposed

32x32 bit multiplier with pipeline is examined. It has now reduced to 10.520 ns. Of course,

this increased performance increases the area overhead due to the insertion of PIPO.

5.6 Synthesis of Proposed Vedic Multiplier

This section deals with the synthesis of the proposed Vedic multiplier module. The FPGA

used is Xilinx Spartan 3 (Family), XC3S400 (Device), Package (PQ208).

Here, the RTL View, its description and device utilization summary is given for each module

Page 56: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

56

5.6.1 2X2 Multiply Block

5.6.1.1 Device utilization summary and Delay

Figure 5.30: RTL View of 2x2 multiply block

5.6.2 4x4 Multiply Block

5.6.2.1 Device Utilization Summary and Delay

Delay: 12.954ns (Levels of Logic = 7)

Page 57: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

57

Figure 5.31: RTL View 4x4 multiply block

5.6.3 8x8 Multiply Block

5.6.3.1 Device Utilization Summary and Delay

Delay: 18.139ns (Levels of Logic = 11)

Page 58: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

58

Figure 5.32 RTL View of 8x8 multiply Block

5.6.4 16x16 Multiply Block

5.6.4.1 Device Utilization Summary and Delay

Delay: 25.235ns (Levels of Logic = 16)

Page 59: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

59

Figure 5.33 RTL view of 16x16 multiply block

Page 60: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

60

5.6.5 32x32 Proposed Vedic Multiplier

Figure 5.34 RTL View of 32x32 Vedic Multiplier

5.6.5.1 Device Utilization Summary and Delay

Logic Levels :23

Page 61: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

61

5.6.6 32x32 Bit Pipelined Multiplier

5.6.6.1 Device Utilization Summary and Delay

Page 62: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

62

Figure 5.35: RTL View of Pipelined 32x32 bit Vedic Multiplier

Page 63: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

63

5.6.7 Proposed Pipelined MAC with Han Carlson Accumulator

5.6.7.1 Device Utilization Summary and Delay

Figure 5.36: RTL View Accumulator

Page 64: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

64

Figure 5.37 : RTL View of Proposed MAC with Han-Carlson Based Accumulator

Page 65: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

65

Chapter-6

RESULT AND COMPARISION

6.1 Results

Simulation of all the modules has been done using ISIM (8.1d) Software. Dynamic Power

Results of 32x32 bit Vedic Multipliers have been analyzed using Xilinx Power Analyzer Tool

(XPA analyzer). Area is analyzed in terms of Slice utilization and Delay as the sum of route

delay and logic delay have been estimated by the synthesis report generated by Xilinx 12.4,

Hardware implementation has been done on Spartan 3, XC3S400(Device). pq208(package)

6.1.1 Simulation of 32X32 bit Proposed Vedic Multiplier

Figure 6.1 Simulation of 32X32 bit Proposed Vedic Multiplier

Description

Ain1:32-bitinputmultiplicand Bin1:32-bitinputmultiplier Sout: 64-bit output

6.1.2 Simulation of 32X32 bit Proposed Vedic Multiplier with Pipelining

Figure 6.2(a)

Page 66: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

66

Figure 6.2(b)

Figure 6.2: (a) and (b) Simulation of 32X32 bit Proposed Vedic Multiplier with Pipelining

Ain1: 32-bit input multiplicand Sout: 64-bit output

Bin1: 32-bit input multiplier Clk: Clock frequency of multiplier

6.1.3 Simulation of 32x32 Proposed Multiplier and Accumulator Unit (MAC)

Figure 6.3: Simulation Output of Proposed MAC

Ain1: 32-bit input multiplicand Sout: 64-bit output

Bin1: 32-bit input multiplier Clk: Clock frequency of multiplier

6.1.4 FPGA Implementation of Proposed Architecture with 8 Bit Operand Size

Hardware implementation of the proposed architecture has been done for 8x8 modules, for

different input vectors. Glowing of test LED‟s represent the logic state‟1‟ and logic state „0‟

otherwise. Inputs have been applied through the input switches.

Input A =”00000000” Input B = “00000000” Output = “0000000000000000”

Page 67: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

67

Figure 6.4: FPGA Implementation of Proposed 8x8 Architecture

Input A= “00000001”

Input B= “11111111”

Output = “0000000011111111”

Figure 6.5: FPGA Implementation of Proposed 8x8 Architecture

Page 68: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

68

6.2 Comparison

6.2.1 Comparison of Proposed Vedic Multiplier with Different Existing Architectures

All the existing urdhva multiplier architectures as discussed in Chapter-4 have been extended

upto 32x32 bit and have been synthesized and simulated using Xilinx 12.4 and ISIM(8.1d)

respectively. Coding for all the modules have been done in VHDL. Qualitative evaluation of

all the architectures at different operand size 4, 8, 16, 32 have been done for wide range of

parameters including area, delay, power, logic depth, memory utilization. Results obtained

are compared with that of the proposed technique.

TABLE 6.1

COMPARISON OF PROPOSED 4X4 MULTIPLIER WITH EXISTING 4X4 MULTIPLIER

ARCHITECTURES

Multiplier Total

Delay

(ns)

Logic

delay

(ns)

Route

Delay

(ns)

Logic

Levels

Slice

utilize-

tion

4-input

LUT

utilize-

tion

Memory

(Kb)

Architecture 1

13.574 8.333 5.241 8 21 41 134772

Architecture

2

13.079 8.080 4.999 7 23 43 134772

Architecture 3

17.862 9.517 8.345 10 19 34 137844

Architecture

4

13.676 8.498 5.178 8 15 29 134772

Proposed Architecture

13.159 8.141 5.018 7 20 39 134772

Page 69: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

69

TABLE 6.2

COMPARISON OF PROPOSED 8X8 MULTIPLIER WITH EXISTING 8X8

MULTIPLIER ARCHITECTURES

Multiplier Total

Delay

(ns)

Logic

delay

(ns)

Route

Delay

(ns)

Logic

Level

Slice

utilize-

tion

4-input

LUT

utilize-

tion

Memory

(Kb)

Architecture1

21.676 10.893 10.751 13 119 227 13680

Architecture

2

26.337 10.414 9.923 12 111 213 136820

Architecture

3

21.608 11.372 10.236 14 104 204 137844

Architecture

4

19.007 10.249 8.758 12 88 172 136820

Proposed

Architecture

18.139 9.996 8.143 11 100 196 136820

TABLE6.3

COMPARISON OF PROPOSED 16X16 MULTIPLIER WITH EXISTING 16X16

MULTIPLIER ARCHITECTURES

Multiplier Total

Delay

(ns)

Logic

delay

(ns)

Route

Delay

(ns)

Logic

Levels

Slice

utilize-

tion

4-input

LUT

utilization

Memory

(Kb)

Architecture

1 30.631 14.621 16.010 21 569 1087 145012

Architecture

2

31.765 14.935 16.830 22 547 1065 145012

Architecture

3

36.160 17.269 18.891 27 432 845 14736

Architecture

4

28.830 13.602 14.778 19 451 868 148084

Proposed

Architecture

25.235 12.287 12.948 16 485 948 143988

Page 70: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

70

TABLE 6.4

COMPARISON OF PROPOSED 32X32 MULTIPLIER WITH EXISTING 32X32

MULTIPLIER ARCHITECTURES

Multiplier Total

Delay

(ns)

Logic

delay

(ns)

Route

Delay

(ns)

Logic

Level

Slice

utiliza

-tion

LUT

utiliza-

tion

Memory

(Kb)

Power

(mW)

Architecture

1

42.268 18.217 24.051 29 2217 4319 175784 99

Architecture 2

46.762 20.954 25.880 35 2148 4168 179428 93

Architecture

3

60.436 26.763 33.673 47 2052 3985 187364 86

Architecture

4

38.855 17.242 21.613 26 2031 3940 195300 83

Proposed

Design

34.652 15.579 19.073 23 1882 366 174620 80

Following results are obtained from table 6.1, 6.2, 6.3, 6.4:

1) Delay of proposed architecture reduces at all operand sizes as this architecture uses

single carry save adder for summation of partial products as against the conventional

architecture which uses 2 ripple carry adder and modified architecture which use

minimum 2 carry look ahead adders for addition of 3 operands. Thus, the carry has to

propagate 2 times for the whole operation. Thus, the time complexity of architectures

with 2 RCA is 2n (n =operand size) and that with two CLA‟s is 2logn. Whereas, the

proposed architecture simply uses a single carry save adder which reduces the time

complexity to (1 + logn). Consequently the logic depth reduces which reduces the logic

delay of proposed design

2) The second reason for reduced delay of this architecture is that it uses a binary to

excess-1 code converter circuit with a multiplexer which produces two parallel results

without waiting for the carry generation from previous unit. Thus, when carry is

generated by CSA adder it simply acts as a selection line to select the correct output. As

Page 71: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

71

against the use of half adder assembly in architecture 4 and ripple carry adder in

architecture 1 or carry save adder for 2-operands in architecture 3 which require carry

from the previous unit to compute the final result. Delay comparison can be understood

in a better way through the graph shown in figure 6.6

Figure 6.6 : Graph showing the delay comparison of various architectures

6.2.2 Comparison of Carry Save Adder Using Various Full Adder Topologies

Multi-operand Carry save adder consists of a ladder of full adders. These full adders have

been implemented using 6 methodologies discussed in Chapter-5. Performance of CSA by

incorporating primitive adders with different full adder circuits has been analyzed at different

operand Size 4,8,16, 32 in terms of Slice utilization (area) and combinational path (delay).

Page 72: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

72

TABLE 6.5

DELAY COMPARISON OF CSA FOR DIFFERENT OPERAND SIZE USING

DIFFERENT FULL-ADDER METHODOLOGIES

DELAY(ns)

4-BIT 8-BIT 16-BIT 32-BIT

Method 1 9.083 12.324 16.537 19.0381

Method 2 10.161 11.885 18.053 19.360

Method 3 9.88 12.974 17.677 20.008

Method 4 10.17 12.197 17.442 18.398

Method 5 10.462 11.711 14.746 15.623

Method 6 10.462 11.275 14.546 16.167

Method 7 10.255 11.745 14.546 16.167

Method 8 10.462 11.745 14.995 16.167

Method 9 9.787 11.591 14.995 17.456

Method 10 8.923 9.59 13.7 15.553

From the synthesis results obtained by implementation of full-adder in all 10 styles show that:

1) Implementation of carry save adder using MUX Approach gives least worst case delay

and occupies least number of occupied slices when compared CSA implemented with

using only NAND , only NOR, AND, OR and XOR style. Thus, CSA using multiplexer

is a better approach to reduce the area as compared to the regular style. The CSA when

implemented in the proposed multiplier significantly reduced the overall area occupied

by the proposed design.

2) This multiplexer based approach has reduced the space complexity of

urdhvatriyakbhyam technique as evident from the table 6.4 showing comparison in

memory utilized (in Kb). Which makes it quite clear that logic optimized primitive

adders prove to be the best choice so far.

3) Mux based approach has minimum cost which is 35 which is lowest in comparision to

other implementation.

Page 73: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

73

TABLE 6.6

AREA COMPARISON OF CSA FOR DIFFERENT OPERAND SIZE USING DIFFERENT

FULL-ADDER METHODOLOGIES

4-bit CSA 8-bit CSA 16-bit CSA 32-bit CSA

Slice

utilize-

tion

No. of

LUT

Slice

utilize-

tion

No. of

LUT

Slice

utilize-

tion

No. of

LUT

Slice

utilize-

tion

No. of

LUT

Method 1 7 13 20 38 45 87 104 202

Method 2 11 22 23 45 58 114 141 277

Method 3 8 16 22 43 62 122 135 264

Method 4 7 13 19 38 45 87 104 202

Method 5 7 13 20 38 45 88 104 203

Method 6 7 13 20 38 45 87 104 202

Method 7 7 14 19 38 45 87 104 202

Method 8 7 14 19 38 45 87 104 202

Method 9 7 14 20 39 45 87 103 201

Method10 8 15 19 37 47 92 103 201

.

Page 74: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

74

6.2.3 Comparison of Proposed Vedic Multiplier with and Without Pipelining

TABLE 6.7

COMPARISON OF PROPOSED 32X32 BIT MULTIPLIER WITH AND WITHOUT

PIPELINING

Delay Area (slice utilization)

Proposed Vedic multiplier

without pipelining

34.658 ns 1882/3584

Proposed Vedic multiplier

with pipelining

10.520 ns

(Max. Frequency:-

95.057 MHz)

1957/3584

From table 6.7 it can be inferred that:

Insertion of pipeline registers shows that it has proved to be the best technique to increase the

throughput of the system. Shrink in the gate delay is quite identified from the above table and

increase the frequency of operation proves it further more. Though the pipelining registers

increase the area of the multiplier but it reduces the delay by 70% which proves faster

operation and throughput compared to conventional vedic multipliers.

6.2.4 Comparison of Various Adders in Terms of Area and Delay

Table 6.8 shows the comparison of various 64-bit adders in terms of area and delay.

Synthesis results show that Han Carlson adder proves to be the fastest adder since it increases

the speed by more than 70% in comparison to the conventional CLA and upto 45% in

comparison to the hybrid carry select adders which are currently being used for wider bit

lengths. Although this increase in speed comes with increase in area. Hence, the proposed

MAC uses Hancarlson adder in the accumulator for high throughput and increased speed.

Figure 6.7 shows the delay comparison of these adders.

Page 75: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

75

TABLE 6.8

AREA AND DELAY COMPARISION OF VARIOUS 64-BIT ADDERS

ADDERS Total

Delay

(ns)

Logic

delay

(ns)

Route

Delay

(ns)

Logic

Level

Slice

utilization

4-input

LUT

utilization

Ripple carry

adder

94.280 35.801 58.479 65 73 127

Carry Look

Ahead adder

87.398 35.801 51.597 65 73 127

Hybrid Carry

select adder

44.553 19.036 25.517 30 110 193

Han Carlson

adder

26.053 12.809 13.244 17 170 302

Page 76: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

76

6.2.5 Area and Delay Results of Proposed MAC Using Han Carlson Adder

TABLE 6.9

AREA AND DELAY RESULTS OF THE PROPOSED MAC

DEVICE Minimu

m

Period

(ns)

Maximum

Frequency

(MHz)

Minimu

m input

arrival

before

clock (ns)

Max.

output

time

after

clock

(ns)

Logic

Level

Slice

utilize

-

Tion

4-input

LUT

utilize-

tion

SPARTAN

-3

XC3S400-

PQ208

10.557 94.771 4.795 6.530 7 1953/ 3584

3520/ 7168

SPARTAN-

3E XC3S500E-

FG320

9.181 108.917 4.469 4.519 7 1961/

4656

3514/

9312

Page 77: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

77

Chapter-7

CONCLUSION AND FUTURE SCOPE

This thesis introduces a novel technique for Urdhvatiryakbhyam Vedic multiplier and

Accumulator unit which is the key element of any digital signal operation. The speed of

operation of several parallel architectures like ALU, DSP chips depend on the datapath which

in turn depends on individual elements like multipliers.

Synthesis and simulation results of the proposed multiplier have been put forward in the

thesis. Along with proposed multiplier the already existing urdhva multiplier architectures

using ripple carry adder, carry look ahead adder and modified vedic multiplier have also been

implemented in the thesis and the difference in their overall performance in terms of delay,

area , power has been observed. Hence, our motivation to introduce the new multiplier is just

fulfilled.

The design and implementation of the work presented in the thesis have taken into account

several factors like slice utilization, delay, power, time and space complexity. One of the

major enhancements techniques used in this thesis to reduce the delay and the logic depth is

making use of a single carry save adder to compress all the partial products and obtain the

final product.BEC-1 has been incorporated in the design for parallel generation of final

product bits. Hence, this combination of CSA and BEC-1 has reduced the delay upto 42%

when compared with the latest existing architecture using 2 carry save adders.

Use of optimized expression for primitive adder and using implementing the carry save adder

with MUX-Based approach has decreased the area by 15% as compared to regular style

implementation.

The thesis uses the application of pipeline to increase the throughput as pipelining increases

the frequency of operation. Hence, we observe the shrink in the gate delay and the minimum

period reduces to 10.52ns. Further, a MAC has been designed utilizing the proposed

pipelined Vedic multiplier and han Carlson adder. The use of Han Carlson Adder in the

design has in turn improvised the performance of MAC to a large extent and has helped to

achieve the maximum frequency of 108.917 MHz.

In future, speed can be further optimized by using PPA which are much more area efficient.

In order to reduce the transistor count for this unit Gate diffusion technique can be used at

circuit level. Since, a number of partial products are generated by 2x2 module in this unit,

Compressors can be used at initial level to reduce the number of partial products.

Page 78: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

78

REFERENCES

1. Purushottam D. Chidgupkar and Mangesh T. Karad, “The Implementation of Vedic

Algorithms in Digital Signal Processing”, Global J. of Engng. Educ., Vol.8, No.2 2004

UICEE Published in Australia.

2. Himanshu Thapliyal and Hamid R. Arabnia, “A Time-Area- Power Efficient

Multiplier and Square Architecture Based On Ancient Indian Vedic Mathematics”,

Department of Computer Science, The University of Georgia, 415 Graduate Studies

Research Center Athens, Georgia 30602-7404, U.S.A.

3. E. Abu-Shama, M. B. Maaz, M. A. Bayoumi, “A Fast and Low Power Multiplier

Architecture”, The Center for Advanced Computer Studies, The University of

Southwestern Louisiana Lafayette, LA 70504.

4. M.E.Paramasivam, Dr.R.S.Sabeenian,” An Efficient Bit Reduction Binary Multiplication

Algorithm using Vedic Methods”, 2010 IEEE 2nd International Advance Computing

Conference

5. Mr. Virendra Babanrao Magar, “Area And Speed Wise Superior Multiply And Accumulate

Unit Based On Vedic Multiplier”, Journal of Engineering Research and Application, Vol. 3,

Issue 6, Nov-Dec 2013, pp.994-999

6. D.Jaina,Sethi ,” Vedic Mathametics based nultiply and accumulate unit.”,2011 IEEE

conference on Computational intelligence and communication system.

7. Jagadguru Swami Sri Bharati Krisna Tirthaji Maharaja, “Vedic Mathematics: Sixteen

Simple M. Pradhan and R. Panda, “Design and Implementation of Vedic Multiplier” A.M.S.E

Journal, Computer Science and Statistics, France vol. 15, July 2010, pp. 1-19.

8. Harpreet Singh Dhillon, Abhijit Mitra, “A Reduced-Bit Multiplication Algorithm for

Digital Arithmetic‟s” International Journal of Computational and Mathematical Sciences,

spring 2008, pp.64-69.

9. S.S. Kerur, Prakash Narchi, Jayashree C .N. Harish M.Kittur and

GirishV.A,“Implementation of Vedic Multiplier for Digital Signal Processing” ,International

Page 79: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

79

Journal of Computer Applications, 2011, vol. 16, pp. 1-5.

11. Verma, P.: “Design of 4X4 bit Vedic Multiplier using EDA Tool,” International Journal

of Computer Application (IJCA), Vol. 8, June, 2012.

12. Abhyarthana Bisoy, Mitu Barat, “Comparison of a 32-Bit Vedic Multiplier With A

Conventional Binary Multiplier”, 2014 IEEE International Conference on Advanced

Communication Control and Computing Technologies (lCACCCT)

13. S.Vijayakumar, Dr.J.Sundararajan” Low power multiplier using VEDIC carry lookahead

Adder”, Research gate conference march 2015.

14. Mi Lu. “Arithmatic and Logic in computer systems”wiley publications

15. Aneesh R, Sarin K Mohan, “Design and Analysis of High Speed, Area Optimized 32x32-Bit

Multiply Accumulate Unit Based on Vedic Mathematics”, International Journal of Engineering

Research & Technology (IJERT) Vol. 3 Issue 4, April – 2014

16.Gijin V George, Anoop Thomas, “High Performance Vedic Multiplier Using Han-Carlson Adder”,

International Journal of Engineering Research & Technology (IJERT) Vol. 3 Issue 3, March - 2014

17. T. Han, David.A. Carlson, “Fast area efficient VLSI adders”,

18.R.K. Bathija,RS Meena “Low Power High Speed 16x16 bit Multiplier usingVedic

Mathematics”,International Journal of Computer Applications (0975 – 8887) Volume 59– No.6,

December 2012

19 .Mohammed Hasmat Ali,, Anil Kumar Sahani,” Study, Implementation and Comparison of

Different Multipliers based on Array, KCM and Vedic Mathematics Using EDA Tools”, International

Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013.

20. Harish Kumar, Hemanth Kumar A R, “Design and Implementation of Vedic Multiplier using

Compressors”, International Journal of Engineering Research & Technology (IJERT), Vol. 4 Issue 06,

June-2015

21. Yogita Bansal Charu Madhu Pardeep Kaur. “HIGH SPEED VEDIC MULTIPLIER DESIGNS- A

REVIEW”, Proceedings of 2014 RAECS UIET Panjab University Chandigarh, 06 – 08 March, 2014

22. Harish Babu N , Satish Reddy N ,” Pipelined Architecture for Vedic Multiplier”, Adavances in

Electrical Engineering, 2014 IEEE conference.

Page 80: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

80

23. R.Uma and P.Dhavachelvan,” Logic Optimization Using Technology Independent Mux Based

Adders In Fpga” International Journal of VLSI design & Communication Systems (VLSICS) Vol.3,

No.4, August 2012

24 .Abdulkarim Al-Sheraidah, Yingtao Jiang,” A Novel Low Power Multiplexer-Based Full

Adder”, ECCTD‟01 - European Conference on Circuit Theory and Design, August 28-31,

2001, Espoo, Finland

25. Taewhan ,Kim , Jao, W. , Tjiang, S. : “Circuit Optimization Using Carry–Save–Adder

Cells,” IEEE Transactions on “Computer-Aided Design of Integrated Circuits and Systems”,

Vol. 17, No. 10,1998, pp. 974-984

26. Prakash Pawar, Varun. R,” Implementation Of High Speed Pipelined Vedic Multiplier”,

International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 5, May -

2013

27. Harpreet Singh Dhillon and Abhijit Mitra,” A Reduced-Bit Multiplication Algorithm for

Digital Arithmetic”, World Academy of Science, Engineering and TechnologyVol:19 2008-

07-25

28. Abhijeet Kumar, Dilip Kumar, Siddhi, “Hardware Implementation of 16*16 bit Multiplier

and Square using Vedic Mathematics”, Design Engineer, CDAC, Mohali.

29. Ch. Harish Kumar,” implementation and Analysis of Power, Area and Delay of Array,

Urdhva, Nikhilam Vedic Multipliers”, International Journal of Scientific and Research

Publications Volume 3, Issue 1, January 2013

30. Kala Priya.KSN Raju,” Carry select adder using BEC and RCA”, international Journal of

Advanced Research in Computer and Communication Engineering,Vol. 3, Issue 10, October

2014.

31. Jan.M.Rabaey. “ Digital integerated circuits.”,Second Edition

32. R.UMA,Vidya Vijayan,M. Mohanapriya, Sharon Paul,” Area, Delay and Power

Comparison of Adder Topologies, “International Journal of VLSI design & Communication

Systems(VLSICS) Vol.3, No.1, February 2012.

Page 81: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

81

APPENDIX A

VHDL CODING

32-BIT PROPOSED VEDIC MAC

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

entity vedmac_pipe_32 is

Port ( clk , resetn: in std_logic;

ain1 : in STD_LOGIC_VECTOR (31 downto 0);

bin1 : in STD_LOGIC_VECTOR (31 downto 0);

sout : inout STD_LOGIC_VECTOR (63 downto 0));

end vedmac_pipe_32;

architecture Behavioral of vedmac_pipe_32 is

component caresave_n is

generic( n: integer := 32);

Port ( a2,b2,c2 : in STD_LOGIC_VECTOR (n-1 downto 0);

cot : out STD_LOGIC);

end component;

component accumulator is

Port ( clk, resetn : in STD_LOGIC;

x : in STD_LOGIC_vector (63 downto 0);

result : inout std_logic_vector (63 downto 0) );

end component;

component BEC_16 is

generic( p : integer := 16);

Port ( B : in STD_LOGIC_VECTOR (p-1 downto 0);

X : out STD_LOGIC_VECTOR (p-1 downto 0);

ccarry: in std_logic);

end component;

Page 82: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

82

component vedmine16_clk is

Port ( ainpu, binputt : in STD_LOGIC_VECTOR (15 downto 0);

soutp : out STD_LOGIC_VECTOR (31 downto 0);

clk: in std_logic);

end component;

component hancarlson_64 is

Port ( ainp,binp : in STD_LOGIC_VECTOR (63 downto 0);

soutp : out STD_LOGIC_VECTOR (63 downto 0);

carry : out STD_LOGIC);

end component;

component pipo is

generic (k : integer:= 64);

Port ( clk , resetn: in STD_LOGIC;

din : in STD_LOGIC_VECTOR (k-1 downto 0);

qout : inout STD_LOGIC_VECTOR (k-1 downto 0));

end component;

signal p : std_logic;

signal m : std_logic_vector(128 downto 0);signal sout1, sum : std_logic_vector(63 downto 0);signal n: std_logic_vector (15 downto 0);signal l: std_logic_vector (63 downto 16);

signal j: std_logic_vector (63 downto 16);

begin

v1: vedmine16_clk port map ( ain1( 15 downto 0), bin1(15 downto 0),m(31 downto 0), clk);

v2: vedmine16_clk port map( ain1( 31 downto 16), bin1(31 downto 16),m(63 downto 32),

clk);

v3: vedmine16_clk port map( ain1( 31 downto 16), bin1(15 downto 0),m(95 downto 64), clk);

v4: vedmine16_clk port map( ain1( 15 downto 0), bin1(31 downto 16),m(127 downto 96),

clk);

c1 : caresave_n port map( m(47 downto 16), m(127 downto 96), m(95 downto 64), l(47 downto 16), m(128));

bec3 : bec_16 port map ( m(63 downto 48), l(63 downto 48), m(128));

Page 83: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

83

f: for i in 0 to 15 generate

d: dff port map (m(i), clk, sout1(i));

end generate;

f3: for i in 16 to 63 generate

d4: dff port map (l(i), clk, sout1(i));

end generate;

add : hancarlson_64 port map( sout1( 63 downto 0) , sout(63 downto 0) , sum (63 downto 0), p);

resg : pipo port map (clk, resetn, sum (63 downto 0 ), sout (63 downto 0 ));

end Behavioral;

16X16 Vedic Multiplier

entity vedmine16_clk is

Port ( ainput : in STD_LOGIC_VECTOR (15 downto 0);

binput : in STD_LOGIC_VECTOR (15 downto 0);

soutp : out STD_LOGIC_VECTOR (31 downto 0);

clk: in std_logic);

end vedmine16_clk;

architecture Behavioral of vedmine16_clk is

component vedmine8_clk is

Port ( ain : in STD_LOGIC_vector(7 downto 0);

bin : in STD_LOGIC_vector (7 downto 0);

sout : out STD_LOGIC_VECTOR (15 downto 0);

clk: in std_logic);

end component;

component caresave_16 is

generic( n: integer := 16);

Port ( a2 : in STD_LOGIC_VECTOR (n-1 downto 0);

b2 : in STD_LOGIC_VECTOR ( n-1 downto 0);

Page 84: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

84

c2 : in STD_LOGIC_VECTOR (n-1 downto 0);

s2 : out STD_LOGIC_VECTOR (n-1 downto 0);

cot : out STD_LOGIC);

end component;

component dff is

Port ( d : in STD_LOGIC;

clk : in STD_LOGIC;

q : out STD_LOGIC);

end component;

component BEC is

generic( p : integer := 8);

Port ( B : in STD_LOGIC_VECTOR (p-1 downto 0);

X : out STD_LOGIC_VECTOR (p-1 downto 0);

ccarry: in std_logic);

end component;

signal m : std_logic_vector(64 downto 0);

signal n: std_logic_vector (7 downto 0);

signal l: std_logic_vector (31 downto 8);

begin

v1: vedmine8_clk port map ( ainput( 7 downto 0), binput(7 downto 0),m(15 downto 0),clk);

v2: vedmine8_clk port map( ainput( 7 downto 0), binput(15 downto 8),m(47 downto 32), clk);

v3: vedmine8_clk port map( ainput( 15 downto 8), binput(7 downto 0),m(63 downto 48),

clk);

v4: vedmine8_clk port map( ainput( 15 downto 8), binput(15 downto 8),m(31 downto 16), clk);

c1 : caresave_16 port map( m(23 downto 8), m(47 downto 32), m(63 downto 48), l (23

downto 8), m(64));

bec1 : bec port map ( m(31 downto 24), n (7 downto 0), m(64));

l (31 downto 24) <= n(7 downto 0) when m(64) = '1' else m(31 downto 24);

f: for i in 0 to 7 generate

Page 85: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

85

d1: dff port map ( m(i) , clk, soutp(i));

end generate;

f2: for i in 8 to 31 generate

d2: dff port map ( l(i) , clk, soutp(i));

end generate;

end Behavioral;

8X8 Vedic Multiplier

entity vedmine8_clk is

Port ( ain : in STD_LOGIC_vector(7 downto 0);

bin : in STD_LOGIC_vector (7 downto 0);

sout : out STD_LOGIC_VECTOR (15 downto 0);

clk: in std_logic);

end vedmine8_clk;

architecture Behavioral of vedmine8_clk is

component caresave_n8 is

generic( n: integer := 8);

Port ( a2,b2,c2 : in STD_LOGIC_VECTOR (n-1 downto 0);

s2 : out STD_LOGIC_VECTOR (n-1 downto 0);

cot : out STD_LOGIC);

end component;

component vedmine4_clk is

Port ( a,b : in STD_LOGIC_VECTOR (3 downto 0);

p : out STD_LOGIC_VECTOR (7 downto 0);

clk: in std_logic);

end component;

component dff is

Port ( d : in STD_LOGIC;

clk : in STD_LOGIC;

q : out STD_LOGIC);

Page 86: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

86

end component;

signal m: std_logic_vector(36 downto 0);

signal l: std_logic_vector(15 downto 4);

begin

v1: vedmine4_clk port map ( ain(3 downto 0), bin( 3 downto 0), m(7 downto 0),clk);

v2: vedmine4_clk port map ( ain(7 downto 4), bin( 7 downto 4), m(15 downto 8), clk);

v3: vedmine4_clk port map ( ain(7 downto 4), bin( 3 downto 0), m(23 downto 16),clk);

v4: vedmine4_clk port map ( ain(3 downto 0), bin( 7 downto 4), m(31 downto 24), clk);

l(12) <= not(m(12)) when m(32) = '1' else m(12);

l(13) <= (m(13) xor m(12)) when m(32)= '1' else m(13);

m(33) <= m(12 ) and m(13);m(34) <= m(33) and m(14);

l(14) <= (m(14) xor m(33)) when m(32) = '1' else m(14);

l(15) <= (m(15) xor m(34)) when m(32) = '1' else m(15);

care1 : caresave_n8 port map(m(11 downto 4), m(23 downto 16), m(31 downto 24), l(11 downto 4), m(32));

f: for i in 0 to 3 generate

h : dff port map ( m(i), clk, sout(i));

end generate;

f1: for i in 4 to 15 generate

h1 : dff port map (l(i), clk, sout(i));

end generate;

end Behavioral;

4X4 Vedic Multiplier

entity vedmine4_clk is

Port ( a : in STD_LOGIC_VECTOR (3 downto 0);

b : in STD_LOGIC_VECTOR (3 downto 0);

p : out STD_LOGIC_VECTOR (7 downto 0);

clk : in std_logic);

Page 87: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

87

end vedmine4_clk;

architecture Behavioral of vedmine4_clk is

signal n: std_logic_vector (17 downto 0);

signal l: std_logic_vector (7 downto 2);

signal cot :std_logic;

component ved is

Port ( a1 : in std_logic_vector (1 downto 0);

b1 : in std_logic_vector (1 downto 0);

s1 : out std_logic_vector (3 downto 0);

clk: in std_logic);

end component;

component ripple_8 is

generic (n : integer := 8);

Port ( a2 ,b2: in STD_LOGIC_VECTOR (n-1 downto 0);

s2 : out STD_LOGIC_VECTOR (n-1 downto 0);

carry1 : out STD_LOGIC);

end component;

component caresave_4 is

Port ( a2 ,b2,c2: in STD_LOGIC_VECTOR (3 downto 0);

s2 : out STD_LOGIC_VECTOR (3 downto 0);

cot : out STD_LOGIC);

end component;

component half is

Port ( x : in STD_LOGIC;

y : in STD_LOGIC;

z : out STD_LOGIC;

co : out STD_LOGIC);

end component;

component dff is

Page 88: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

88

Port ( d : in STD_LOGIC;

clk : in STD_LOGIC;

q : out STD_LOGIC);

end component;

begin

v1: ved port map (a(1 downto 0), b(1 downto 0), n(3 downto 0),clk);

v2: ved port map (a(3 downto 2), b(1 downto 0), n(15 downto 12),clk);

v3: ved port map (b(3 downto 2), a(1 downto 0), n(11 downto 8),clk);

v4: ved port map (a(3 downto 2), b(3 downto 2), n(7 downto 4),clk);

care1: caresave_4 port map(n(5 downto 2), n(11 downto 8), n(15 downto 12), l(5 downto 2), n(16));

f: for i in 0 to 1 generate

d : dff port map( n(i) , clk , p(i));

end generate;

f1: for i in 2 to 7 generate

d1 : dff port map(l(i) , clk , p(i));

end generate;

h1: half port map( n(6), n(16), l(6),n(17));

l(7) <= (n(17) xor n(7) )when n(16) = '1'

else n(7);

end behavioral;

2X2 Vedic Multiplier

entity ved is

Port ( a1 : in STD_LOGIC_VECTOR (1 downto 0);

b1 : in STD_LOGIC_VECTOR (1 downto 0);

s1 : out STD_LOGIC_VECTOR (3 downto 0);

clk : in STD_LOGIC);

end ved;

Page 89: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

89

architecture Behavioral of ved is

component dff is

Port ( d : in STD_LOGIC;

clk : in STD_LOGIC;

q : out STD_LOGIC);

end component;

begin

process ( clk, a1, b1)

variable m: std_logic_vector (8 downto 1);

begin

m(5) := a1(0) and b1(0); m(1) := a1(1) and b1(0); m(2) := a1(0) and b1(1); m(3) := a1(1) and

b1(1); m(6) := m(1) xor m(2); m(4) := m(1) and m(2); m(7) := m(3) xor m(4); m(8) := m(3)

and m(4);

f: for i in 0 to 3 loop

if (clk' event and clk = '1') then

s1 (i) <= m(i + 5);

end if;

end loop;

end process;

end Behavioral;

Binary to Excess-1 Code Converter

entity BEC_16 is

generic( p : integer := 16);

Port ( B : in STD_LOGIC_VECTOR (p-1 downto 0);

X : out STD_LOGIC_VECTOR (p-1 downto 0);

ccarry: in std_logic);

end BEC_16;

architecture Behavioral of BEC_16 is

Page 90: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

90

signal m: std_logic_vector(( p-1) downto 1);

begin

X(0) <= (NOT B(0)) when ccarry = '1' else b(0);

x(1) <= b(1) xor b(0) when ccarry = '1' else b(1);

m(1) <= b(0) and b(1);

g: for i in 2 to p-1 generate

m(i)<= m(i-1) and B(i);

X(i) <=( m(i-1) xor B(i)) when ccarry = '1' else b(i);

end generate ;

end behavioral;

Multioperand Carrysave Adder

entity caresave_n is

generic( n: integer := 32);

Port ( a2,b2,c2 : in STD_LOGIC_VECTOR (n-1 downto 0);

s2 : out STD_LOGIC_VECTOR (n-1 downto 0);

cot : out STD_LOGIC);

end caresave_n;

architecture Behavioral of caresave_n is

component half is

Port ( x,y : in STD_LOGIC;

z ,co: out STD_LOGIC);

end component;

component full is

Port ( x1,y1,cin : in STD_LOGIC;

z1 ,co1: out STD_LOGIC);

end component;

signal s, c: std_logic_vector(n-1 downto 0);

signal l: std_logic_vector( n-2 downto 0);

begin

Page 91: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

91

g: for i in 0 to n-1 generate

f: full port map (a2(i), b2(i), c2(i), s(i), c(i));

end generate;

s2(0)<= s(0);

h1 : half port map ( c(0), s(1), s2(1), l(0));

fu: for i in 2 to n-1 generate

f: full port map (s(i), c(i-1), l(i-2), s2(i), l(i-1));

end generate;

cot <= c(n-1) or l(n-2);

end Behavioral;

64-BIT HANCARLSON ADDER

entity hancarlson_64 is

Port ( ainp,binp : in STD_LOGIC_VECTOR (63 downto 0));

soutp : out STD_LOGIC_VECTOR (63 downto 0);

carry : out STD_LOGIC);

end hancarlson_64;

architecture Behavioral of hancarlson_64 is

begin

process (ainp, binp)

variable p, p0, p1, p2, p3,g, g0, g1, g2, g3, cint, g4 , p4, p5, g5: std_logic_vector(63 downto 0);

begin

q1: for i in 0 to 63 loop

p0(i) := ainp(i) xor binp(i);

g0(i) := ainp(i) and binp(i);

end loop;

q2: for i in 0 to 31 loop

p(2 *i + 1) := (p0 (2 *i + 1) and p0( 2*i));

g(2 *i + 1) := (g0(2*i + 1) or (p0(2*i + 1) and g0(2* i)));

Page 92: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

92

end loop;

q3: for j in 1 to 31 loop

p1(2*j + 1) := ((p(2*j + 1) and p( 2*j - 1)));

g1(2*j + 1) := (g(2*j + 1) or (p(2*j + 1) and g(2*j - 1)));

end loop;

p2(5) := p1(5) and p(1);

g2(5) := (g1(5) or (p1(5) and g(1)));

q4: for i in 3 to 31 loop

p2(2 *i + 1) := (p1 (2 *i + 1) and p1( 2*i - 3));

g2(2 *i + 1) := g1(2 *i + 1) or (p1(2 *i + 1) and g1(2* i - 3));

end loop;

p3(9) := p2(9) and p(1);g3(9) := g2(9) or( p2(9) and g(1));

p3(11) := p2(11) and p1(3);g3(11) := g2(11) or( p2(11) and g1(3));

q5: for i in 6 to 31 loop

p3(2 *i + 1) := (p2 (2 *i + 1) and p2( 2*i - 7));

g3(2 *i + 1) := g2(2 *i + 1) or (p2(2 *i + 1) and g2(2* i - 7));

end loop;

p4(17) := p3(17)and p(1);g4(17) := g3(17) or( p3(17) and g(1));

p4(19) := p3(19)and p1(3);g4(19) := g3(19) or( p3(19) and g1(3));

p4(21) := p3(21)and p2(5);g4(21) := g3(21) or( p3(21) and g2(5));

p4(23) := p3(23)and p2(7);g4(23) := g3(23) or( p3(23) and g2(7));

q6: for i in 12 to 31 loop

p4(2 *i + 1) := (p3 (2 *i + 1) and p3( 2*i - 15));

g4(2 *i + 1) := g3(2 *i + 1) or (p3(2 *i + 1) and g3(2* i - 15));

end loop;

p5(33) := p4(33)and p(1);g5(33) := g4(33) or( p4(33) and g(1));

p5(35) := p4(35)and p1(3);g5(35) := g4(35) or( p4(35) and g1(3));

p5(37) := p4(37)and p2(5);g5(37) := g4(37) or( p4(37) and g2(5));

p5(39) := p4(39)and p2(7);g5(39) := g4(39) or( p4(39) and g2(7));

Page 93: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

93

p5(41) := p4(41)and p3(9);g5(41) := g4(41) or( p4(41) and g3(9));

p5(43) := p4(43)and p3(11);g5(43) := g4(43) or( p4(43) and g3(11));

p5(45) := p4(45)and p3(13);g5(45) := g4(45) or( p4(45) and g3(13));

p5(47) := p4(47)and p3(15);g5(47) := g4(47) or( p4(47) and g3(15));

q7: for i in 24 to 31 loop

p5(2 *i + 1) := (p4 (2 *i + 1) and p4( 2*i - 31));

g5(2 *i + 1) := g4(2 *i + 1) or (p4(2 *i + 1) and g4(2* i - 31));

end loop;

g5(31) := g4(31); g5(29) := g4(29); g5(27) := g4(27); g5(25) := g4(25); g5(23) := g4(23);

g5(21) := g4(21); g5(19) := g4(19); g5(17) := g4(17); g5(15) := g3(15); g5(13) := g3(13);

g5(11) := g3(11); g5(9) := g3(9); g5(7) := g2(7); g5(5) := g2(5); g5(3) := g1(3); g5(1) := g(1);

cint(0) := g0(0);

q8: for i in 0 to 30 loop

cint(2*i + 1) := g5(2*i + 1);

end loop;

q9: for i in 1 to 31 loop

cint( 2*i) := (g0(2*i) or (p0(2*i) and g5(2*i - 1)));

end loop;

soutp(0) <= p0(0) xor '0';

q10: for i in 1 to 63 loop

soutp(i) <= p0(i) xor cint(i-1);

end loop;

carry <= g5(63);

end process;

end Behavioral;

Page 94: Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

94

APPENDIX B

LIST OF PUBLICATION

1) Prachi Devpura, Anurag Paliwal, “High throughput Vedic Multiplier using Binary to

Excess-1 Code Converter”, International Journal of Advance Research in Electronics

and Communication Engineering ,Volume 4 issue 6,june 2015 ; PP 1771-1774.

2) Prachi Devpura , Anurag Paliwal, “High Throughput BEC-1 Based 32*32 bit Vedic

Multiplier”, International Journal of Research in Electronics and Computer Engineering

(IJRECE), Vol. 3 Issue 3 sept.2015; PP 24-27