10.1.1.1.4951

7/30/2019 10.1.1.1.4951

1/112

16-bit Booth Multiplier

with 32-bit Accumulate

Marc MoskoCMPE223 Independent Study

7/30/2019 10.1.1.1.4951

2/112

7/30/2019 10.1.1.1.4951

3/112

CMPE223 Booth Multiplier Marc Mosko

Table of Contents

Introduction......................................................................................................................................3Basic Design ....................................................................................................................................4

Performance Estimates ................................................................................................................5Booth Multiplier ..........................................................................................................................6

VHDL Source Code.......................................................................................................................10

Code Overview..........................................................................................................................10I/O Register Design ...................................................................................................................13

Example Register Access ..........................................................................................................13

Source Code...............................................................................................................................17Source Code Hierarchy..............................................................................................................18

VHDL Code Versions................................................................................................................20Overflow Logic..............................................................................................................................22

Magic Layout .................................................................................................................................23Design Hierarchy .......................................................................................................................24RSIM Calibration.......................................................................................................................28

Optimization ..............................................................................................................................29References......................................................................................................................................32

VHDL Source Code.......................................................................................................................33Addcell.vhd................................................................................................................................33Adder.vhd ..................................................................................................................................34

Booth.vhd...................................................................................................................................36claN.vhd .....................................................................................................................................38

driverN.vhd ................................................................................................................................41

latch.vhd.....................................................................................................................................42mult.vhd .....................................................................................................................................47

mult_cla.vhd ..............................................................................................................................53mult_pipe.vhd ............................................................................................................................54

7/30/2019 10.1.1.1.4951

4/112


Invchain ...................................................................................................................................101Mcell ........................................................................................................................................103

Mcell ........................................................................................................................................104Ppmux......................................................................................................................................105

Ppmuxfa...................................................................................................................................107Rwire........................................................................................................................................108Wiring cells (passive) ..............................................................................................................109

7/30/2019 10.1.1.1.4951

5/112


Introduction

This report presents three main topics we investigated as part of a project to build a Booth

encoded multiply/accumulate VLSI chip. The original scope of work included synthesizing

VHDL code using the Mentor Graphics tools. Exemplar was the VHDL compiler. Leonardo

Spectrum was the synthesizer. Since my team, which included Kevin Delaney, did not meet a

Mosis deadline our chip funding was lost. Since we did not actually fabricate a chip, we cannot

discuss the success of our results. Likewise, VHDL synthesis using the Exemplar tools was not

very successful, so we do not discuss synthesis results except in passing. The main points we

cover are the basic architecture, our VHDL code, and a Magic layout in place of logic synthesis.

The work presented here, except as cited, is almost entirely my own. Teamwork with Kevin

Delaney had some influence on the VHDL code, since he was primarily working on the synthesis

portion of the project.

Due to length considerations, we have not included all VHDL code or any test suites. We have

7/30/2019 10.1.1.1.4951

6/112


Basic Design

The goal of the multiplier is to compute X[15:0] * Y[15:0] + W[31:0] = Z[31:0] and OVRFLW.

OVRFLW is the multiply-accumulate overflow. We discuss OVRFLW in more detail below. It

is not simply the carry-out of the final addition.

Our multiplier is based on a booth encoded array multiplier design in [3,4]. The 32-bit adder we

use for the final addition is from [1,2,4]. We used a Carry-Select Adder (CSA) since it has fairly

regular layout and good performance.

The VHDL design is a 3-stage pipeline with I/O registers and common 16-bit I/O bus. A

complete transaction takes 7 complete cycles: load X, load Y, load W_H, load W_L, Multiply,

read Z_H, read Z_L. Our design can pipeline the multiply with loading a value, such as the next

operations X, so in a stream we are down to 6 cycles. The 6 or 7 cycle length is a limitation of

7/30/2019 10.1.1.1.4951

7/112


improperly sizes transistors that did not pass 1 or sometimes 0 with enough force to drive the

whole CPL NMOS chain. [3] also uses cross-coupled minimum sized PMOS latches to restore

the swing to output inverters. RSIM did not correctly simulate the swing restore, so we had to

remove the cross-coupled latches.

We have verified correct operation of both the VHDL and Magic circuits with several boundary

cases and 10,000 random multiply/accumulates. The VHDL test cases ran through the I/O

registers while the Magic cases were raw arithmetic computations. The Magic layout had many

problems, particularly with the carry-select adder design, which uses pass-logic. As of the

writing of this report, we verified 10,000 random cases on the Magic layout with one error. We

have fixed that error, but do not have time to rerun the whole batch. It takes about 8 hours to run

all cases (we used four machines at 2 hours each).

Per fo rmance Est im atesBased on timing estimates from Leonardo Spectrum, we believe the VHDL system will run at

about 48 MHz on a 2-phase clock (about 7ns per phase, 3 phases). We do not believe these

7/30/2019 10.1.1.1.4951

8/112


down to 0.1ns or less. There are a handful of 0.2ns transisitons in the critical path. The critical

path has 61 transitions.

Because of uncertainty in both the Leonardo timing and the RSIM calibration, it is possible that

both results are substantially off. The section RSIM Calibration describes our approach to

calibrating RSIM for the AMI C5N 0.5 process.

The original fast adder in the Magic layout was a 2-2-4-4-4-8-8 CSA adder. This design, based

on [2], assumes that all inputs arrive at about the same time. That is not the case here.

Generally, the last bit to the adder is around Z[16], so one might wish to experiment with a 2-2-

4-4-2-2-4-4-4 or other variant. Please see our comments on the CSA adder in the Optimization

section below.

We had time to try a 2-2-4-4-4-4-4-8 adder, and our maximum time dropped from 7ns to 6.1ns

for 4EF9 x E1DC + 287CF2D0 (we have since reduced the time even further). Our intuition

7/30/2019 10.1.1.1.4951

9/112


present schematics of each component in a later section with the Magic cell layouts. For now,

we wish to present only a high-level floorplan.

Figure 2. shows the floorplan of a 6-bit Booth multiplier with 12-bit accumulate. It is essentially

the same as the 16-bit multiplier. We will use the 6-bit version in our present discussion, since

the floorplan fits on a single page. The vertical dashed lines are continuations of the X inputs.

We used dashed lines to make it easier to see the regular wiring pattern.

The 12-bit accumulate requires a set of full adders as shown. The first five bits of the

accumulate, W[4:0], use the first Booth row for addition. W5 cannot be added with X5. X5 is a

sign bit but W5 is not. Therefore, we must add W5 with an adder on the outside of the array. A

standard array multiplier has a fast adder outside the array. Along the bottom of the array, a sum

bit is 1i j jZ C S+= + . Sand C are the sum and carry outputs of the bottom Booth row ppfacomponents. In our case of a 6-bit multiplier, j=i-5. To add in a third bit, Wi, we use full adder

to compute the sum S and carry C from C S W+ + We may then use a fast adder to

7/30/2019 10.1.1.1.4951

10/112


single full adders have no ripple carry, this style seems to work well.

The design in [3,4] uses a sign extension mechanism between booth-encoded rows. There is no

constant offset, as in the current Kestrel multiplier. The sign extension uses the partial product

output (pp_out) from the left-most column of the previous row. The pp_out output is the output

of the partial product mux before the adder. Using this and the ff output of the previous rows

sgn component, the sign extender computes new outputs to carry the sign to the next row. This

technique uses one additional column in the array.

7/30/2019 10.1.1.1.4951

11/112


fa

p

pfa

x4

d

w4

ppf

a

w0

x0

gnd

add

cell

fa

p

pfa

ppf

a

add

ce

ll

ppfa

pp

fa

ppfa

ppfa

pp

fa

ppfa

fa

p

pfa

ppf

a

add

cell

ppfa

pp

fa

ppfa

12-bitCSAAdder(1/2)

HA

w1

x1

w

2

x2

w3

x3

w5

FA

w6

12-bitCSA

Adde

r(1/2)

FA

w7

F

Aw8

FA

FA

FA

w9

w10

w11

7/30/2019 10.1.1.1.4951

12/112


VHDL Source Code

This section presents the most recent VHDL source code for a pipelined Booth encoded

multiplier. The code presented below may be found in the directory

http://www.cse.ucsc.edu/~mmosko/cmpe223/report2/vhdl. There are several other versions

under /projects/kestrel/users/mult/marc/vhdl/booth-1. The code presented here is mostly based

on the leo directory (short for Leonardo, the synthesizer). The last part of this section makes

brief comments on the other versions.

We used C++ to model the Booth multiply/accumulate before writing the VHDL code. The link

is http://www.cse.ucsc.edu/~mmosko/cmpe223/report2/cpp. There are three versions. The first

is an 8-bit adder using 4:2 compressors. The second is a 16-bit also with 4:2 compressors. The

third is a 16-bit with a fast adder. We shall not discuss this code any further in the interests of

space.

7/30/2019 10.1.1.1.4951

13/112


The VHDL code mirrors the design in Fig. 4 as closely as possible. We made some abstractions.

An abstract data type in VHDL replaces the one-hot booth encoding. This allows the synthesizer

to use whatever technique it chooses. The adders are abstract + signs, not actual fast adder

implementations. The synthesizer may then use whatever style is appropriate.

Referring to Fig. 4, there are two main sections to the multiplier. The top section is the chip I/O

consisting of six 16-bit registers, an overflow register, a 16-bit common I/O bus, and control

signals. The bottom section is the pipelined Booth multiplier. The multiplier begins

i hi 2 d h l d h f ll i l hi 2 W di h

Signal Direction Purposebusio_s2h In/Out 16-bit input/output from multilier.

ovrflw_s2h Out OVRFLW output

bussel_s2h In 3-bit mutiplexed register selection.

bs_s2h In Input from bus (H) or multiplier (L). Applies to Z registers.

rw_s2h In Read from bus (H) or write to bus (L).

me_s2h In Multiplier Enable (perform the calculation)

rst_s2h In Reset all registers to 0.

clk In phi_1H, phi_1L, phi_2H, phi_2L

7/30/2019 10.1.1.1.4951

14/112


The second pipeline stage consists of four more Booth encoders reading from latched Y values.

The computation begins on phi_1 and the results are latched at the end of the phase. There is

another array multiplier, which continues the multiplication process. There is also an unsigned

8-bit adder to sum the results from the first pipeline section. Leonardo synthesized an Inverted

Nibble adder. Since this addition is independent of the results in the second pipeline stage, we

can perform this addition with little overhead.

According to Leonardo timing estimates, the 8-bit adder is not necessarily for free. The 8-bit

adder takes a comparable amount of time to the 4-row Booth multiplier. In fact, in some timing

runs the 8-bit adder took longer than the multiply, indicating that it might not be a good idea to

try the addition in this pipeline stage. Because of problems we had with the Leonardo timing

estimates, we did not finish an analysis of this question. We would surmise that since all

pipeline stages have the same period, the second pipeline stage with 8-bit accumulate would still

take less time than the 24-bit accumulate in the third stage.

7/30/2019 10.1.1.1.4951

15/112


I /O Regist er Design

The I/O registers follow the schematic

in Fig. 3. The signal medly_s2h is

used to clock in the value from

multin_v2h. It only applies to the Z

registers. The multiplier generates the

delayed multiplier enable signal,

medly, as part of the pipeline. The

output signal store_s2h feeds both

multout_s2h and busout_s2h. The tri-

state drivers for the I/O bus are located

in different VHDL code because of problems we had with the VHDL compiler. The register

drives the bus when sel_s2h and not(rw_s2h) is true, otherwise it is tri-state.

0

1D Q

CLK RST

DFF

0

1

bs_s2h

multin_v2h

busin_s2hw2_s2h store_s2h

csel_s2h

csel_q2h

phi_2h

rst_q2h

rst_s2h

medly_s2h

rden_s2h

rden_s2h

sel_s2h

rw_s2h

Figure 3. Multiplier I/O Register

Line-----A-Bus output-------Bus input--------BBB-C-D-E-F00000122 1011111001000011 1011111001000011 000 1 1 1 0

7/30/2019 10.1.1.1.4951

16/112


Column A is the expected carry out, which is set when reading from the multiplier. Bus Output

is the expected bus output. Bus input is the external driver to the bus. Column A is ovrflw_s2h.

Column B is the bussel_s2h signal. Column C is the bs_s2h. Column D is the rw_s2h signal.

Note that rw_s2h only affects the Z registers. Column E is the me_s2h signal. Column F is

rst_s2h.

Prior to line 122, values were loaded in to the X, Y, and W registers. On line 122, we enable

me_s2h, which latches the X, Y, and W register values in to the multiplier. Because of our I/O

register design, we may simultaneously load a new value in to a register while reading the

register. Line 122 loads a new value 1011111001000011 (BE43) into the X register by

selecting register 0 via bussel_s2h and asserting rw_s2h.

Line 123 loads a new value in to the Y register and simultaneously stores the multiplier output in

to the Z registers. The new value 1010111101100101 (AFC5) is stored in the Y register

by selecting register 1 via bussel s2h and asserting rw s2h. The multiplier result is stored in

7/30/2019 10.1.1.1.4951

17/112


from a D-flipflop. In our test stimulus file, lines other than 124, 125, 12a, and 12b are - dont

care.

Lines 126 and 127 load the W values (2D04417F). Lines 128 and 129 are similar to line 122 and

123. They load the next X and Y value and compute BE43 * AFC5 + 2D04417F = 41B71EEE.

We read the Z values in lines 12a and 12b.

7/30/2019 10.1.1.1.4951

18/112


in rw_s2h

in

reg 0x[15:0]

reg 1y[15:0]

reg 2w[31:16]

reg 3w[15:0]

reg 4Z[31:16]

reg 5Z[15:0]

dffOvrflw

BUSIO_S2H[15:0]

bussel_s2h[2:0] 3:8

demuxin

out

io

inrst_s2h

bs_s2h

ovrflw_s2h

in

in

clk[3:0]

clk = phi1_h/l,

phi2_h/l

me_s2h

16 x 4 Booth Multiplier

(9bits)

Booth(20bout)

16 x 4 Booth Multiplier

pipeline registers

20bout)

pipeline

registers

8bunsingedadd

pipeline

registers

A

(1/3)

y[8:0]

y

[1

5:

8]

z[7:0]

z[15:8]

ovrflw

logic

7/30/2019 10.1.1.1.4951

19/112


Source Code

The table below lists the 59 files that are part of the VHDL code. Generally, there are three or

four files associated with each major component. For the component foo, there would be

foo.vhd, which is the instantiation of the entity and architecture. foo_test.vhd is a test script that

uses foo.vhd as a component (UUT). The test script reads stimulus from foo.txt. Sometimes

there will be a foogen.{pl|cc} to generate the stimulus.

Fileno File Name Description1 addcell.txt Test file stimulus

2 addcell.vhd Implements +1 when booth sign negative3 addcell_test.vhd Test script for addcell4 adder.txt Test file stimulus5 adder.vhd Single bit and N bit full adder6 adder_test.vhd

7 adder15.txt Test file stimulus8 adder15gen.pl Generates stimulus for exhaustive 15-bit adder, incorrect carry

out9 adderN_test.vhd A 15-bit adder test using "adder.vhd"10 adk.vhd Cell library for Leonardo11 booth.txt Test file stimulus12 booth.vhd Booth type (abstracts one-hot), booth encoder, sign propagation13 booth_test.vhd

14 claN.vhd An n-bit carry-lookahead adder (abstract "plusN" and

7/30/2019 10.1.1.1.4951

20/112


29 modelsim.ini Ini file for Exemplar VHDL30 mult.txt Test file stimulus

31 mult.vhd The whole multiplier with I/O registers32 mult_test.vhd Test for "mult.vhd"33 mult_test_1.vhd Test for synthesized "mult.vhd" (uses std_ulogic)34 mult_cla.vhd The CLA adder used by the multiplier35 mult_framegen Generates test cases (boundary and random)36 mult_framegen.cc C++ source code for mult_framegen37 mult_frame.tcl A timing analysis file example

38 mult_frame.txt Test stimulus39 mult_frame.vhd The booth array multiplier and CLA adder40 mult_frame_test.vhd Test for "mult_frame"41 mult_frame_test_u.vhdTest for "mult_frame" (synthesized)42 mult_pipe.txt Test stimulus43 mult_pipe.vhd The booth array multiplier44 mult_pipe_test.vhd Old test script -- out of date

45 multgen.cc Generates test cases for "mult.vhd"46 multreg.vhd N-bit multiplier register using "dffr_fall" and an input buffer47 multregN.txt Test stimulus for 4-bit register, non-exhaustive, out of date (for

latch, not dff)48 multreg_test.vhd Test script for 4-bit register49 Mymake CSH script to create everything50 plusN_test.vhd Tests abstract N-bit adder

51 pp.vhd Partial-product cells ppmux, ppfa, ppfapp)52 ppfa.txt Test cases for "ppfa"53 ppfagen.pl Generates test cases for "ppfa"54 t t T f " "

7/30/2019 10.1.1.1.4951

21/112


computes bus_wr_h[7:0] as a one-hot control signal to a set of 16-bit tri-state buffers for each

I/O registers busout_s2h signal. Finally, the code computes the ovrflw signal.

The component multregn instantiates an N-bit I/O register, as described above. It uses the

components dffrN_fall (file 25) and buf (file 22). DffrN_fall is a N-bit D-flipflop with reset

clocked on the falling edge. Buf is a 1-bit buffer. We had to play some tricks with signal

buffering to ensure proper fan-out. Leonardo had trouble with our source code and generating

proper fan-out. We believe the problem was that we did not follow a strict hierarchy structure of

combinatorial logic followed by registers.

The component mult_cla instantiates a 24-bit fast adder. It uses the component plusN (file 14).

PlusN is an abstracted + operation in VHDL with some added logic to compute the carry. We

used to have mult_cla and mult_pipe in the same source file as part of the same component.

We separated them at some point because of timing simulation problems with Leonardo.

7/30/2019 10.1.1.1.4951

22/112


component sgn (file 51) implements the sign extender of Fig. 2. The components dffr_fall,

dffrN_fall, gdffr_fall, and gdffrN_fall (all file 25) implement single bit and N-bit D-flipflops

with reset. The g versions are gated and have a tri-state Enable input (no longer used).

Inside mult_pipe , we used to drive each pipeline stage from a gated transparent latch. By using

a gated latch, we could conserve power by eliminating spurious transitions while computing the

previous pipeline stage. At the end of the first pipeline stage, for instance, we would latch the

data at the end on phi_2 and enable the tri-state output at the beginning of phi_1. We used the

components glatchrN , etc. When we switched to the DFF, there was no reason to continue

using a gated version, since the flipflop is not transparent. Thus, the gdffand dffcomponents

are identical except for an extra Enable signal that does nothing. We preserved the Enable input

such that there were not changes to our code semantics.

VHDL Code Vers ion s

There are six versions of the VHDL code. The code that best synthesizes is in a directory called

leo under /projects/kestrel/users/mult/marc/vhdl/booth-1. The leo code was the basis for the

7/30/2019 10.1.1.1.4951

23/112


double-rail nature and had better synthesis results. Kevin Delaney found a cell library for

Leonardo, the synthesis tool. The cell library is called ADK. We began using the ADK cell

library in the source tree adk. adk is a non-pipelined multiplier. adk-pipe is a pipelined

multiplier. adk-pipe-cla is a pipelined multiplier with carry-look-ahead adder. We hard-coded

the CLA structure with a behavioral description. In our final version, we steered away from

being so specific and just use a + sign.

We learned several things from these many versions and our efforts at synthesis. In our opinion,

one should try to be as abstract as possible and let the synthesizer figure out the specifics. One

must be aware of automatic register generation and what sort of statements will not synthesize.

Apart from those concerns, we would recommend staying away from gate-level specifics. When

one tries to enforce a specific structure, there is usually competition with the synthesizer and no

one wins. There are directives to give the synthesizer guidelines for specific modules, but we did

not have much success with them.

7/30/2019 10.1.1.1.4951

24/112


Overflow Logic

A multiply-accumulate where all words

are n-bit does not have overflow. Our

architecture, however, does have the

potential for overflow since the

accumulate is twice the word size of the multiplier/multiplicand. We compute a signed overflow

from the following two assertions for Z[m:0]=X[n:0] * Y[n:0] + W[m:0], where in our case

n=15 and m=31. There is overflow if (1) x*y > 0, w > 0 and z

7/30/2019 10.1.1.1.4951

25/112


Magic Layout

The table below lists the 53 files that make up the Magic layout. In general, there are three types

of files, similar to the VHDL directory structure. For the component foo, the file foo.mag is the

Magic cell. foo.cmd is the RSIM command file that runs a test suite. Some components will

have a foo.{pl|c|cc} program to generate the test cases. Sometimes, there is a foo_head.cmd file

with the header portion of the CMD file independent of the test cases. There is also a csa

subdirectory with a VHDL model of the CSA adder. To view these files with the recompiled

Magic, set the environment variable CAD_HOME=/projects/kestrel/users/mult/tools and

execute Magic as magic -TSCN3ME_SUBM.30 from $CAD_HOME/bin.

Fileno File Name Description

60 Addcell.cmd RSIM command file w/ exhastive stimulus

61 Addcell.mag Generates the +1 for negative Booth encoding

62 broute.mag A wiring channel

63 bth.cmd RSIM command file w/ exhastive stimulus

64 bth.mag Booth encoding and sign propagation

65 bthbuf.mag Inverter chain for booth lines

66 bthroute.mag Wire routing for "bth" cell

67 bwire.mag Wiring channel

7/30/2019 10.1.1.1.4951

26/112


Fileno File Name Description

85 csa_8.mag CSA 8-bit chain

86 csa_cond.cmd RSIM command file w/ exhastive stimulus87 csa_cond.mag CSA conditional input section

88 csa_first.mag CSA first cell in multi-bit chain

89 csa_last.mag CSA last cell in multi-bit chain

90 csa_mid.cmd RSIM command file w/ exhastive stimulus

91 csa_mid.mag CSA middle cell in multi-bit chain

92 csa_wire.mag Used in CSA_32

93 fa.cmd RSIM command file w/ exhastive stimulus

94 fa.mag Full adder CPL style

95 fa_cmos.mag Full adder CMOS style

96 fa_tg.cmd RSIM command file w/ exhastive stimulus

97 fa_tg.mag Full adder w/ 1 level deep TG style for ppfa cell

98 fa_tg2.mag Full adder w/ 1 level deep TG style for W sum

99 invchain.mag Single-rail to double-rail inverter chain

100 invtop.mag Top row inverter chains for X and W

101 mcell.cmd RSIM command file w/ exhastive stimulus102 mcell.mag Multiplier cell (ppmuxfa and wiring)

103 mult_head.cmd Header file for RSIM (no test cases)

104 mult_add.cmd RSIM file with random tests

105 mult_add.mag 16x16 Booth multiplier with 32-bit accumulate

106 mult_add_head.cmd RSIM file header

107 multgen.cc C++ program to generate "mult_add" test cases

108 ppmux.cmd RSIM command file w/ exhastive stimulus

109 ppmux.mag TG style partial product mux

110 ppmuxfa.mag ppmux with full adder (fa_tg)

111 rwire.mag Wiring channel and inverters to drive CSA

7/30/2019 10.1.1.1.4951

27/112


The top-level cell is mult_add.mag. This cell has some glue wiring and all the raw input/output.

The X input is via the cell invtop[15:0]/X_H. The W input connects directly to the wires

Wn_H, where n ranges from 15 to 31 and to the cells invtop[14:0]/X_H. The Y input connects

directly to the wires Yn_H, where n ranges from 0 to 15. The output Z connects to the Sn_H

outputs of various CSA cells. The OVRFLW output connects to ovrflw_0/ovrflw_h.

The X and W[14:0] inputs pass through the cell array invtop. These are inverter chains along

the top of the multiplier to generate the proper drive for the long X wires. The W inverts are

small, since those signals only drive the adder in the top row of the multiplier. The X signals

must drive about 0.450 pF. The cell invtop connects directly to the multiplier array cells, mcell.

The Y input connects to the cell bth along the left side of the multiplier. The bth cell produces

the 5-bit one-hot Booth encoding of the Y word [3]. The bth cell also computes the sign

propagation [3]. There are three Y inputs per bth cell, with one input common between two

cells. Each bth cell generates a double-rail Y signal with a small inverter chain. The output of

7/30/2019 10.1.1.1.4951

28/112


The main array cell is mcell. It contains three components: ppmux, fa_tg, and wroute. Ppmux

is a pass-logic multiplexer to select the proper X input based on the Booth encoding for the row

[3]. The cell fa_tg is a double-rail transmission-gate based full adder [3]. It calculates the sum

and carry in parallel. There are four output inverters for the sum and carry-out. We added four

input inverters for the B_H/B_L inputs, one pair of inverts for each of the carry and sum logic.

We found there was too much back-pressure from the transmission gates and it caused

uncertainty in RSIM about who was driving whom. Wroute is a wire channel routing cell to

pass horizontal and vertical signals. The sum out connects two columns to the right while the

carry-out connects one column to the right. The X signals pass directly down.

Along the right side of the mcell array is a column of addcell. Addcell checks the rows Booth

encoding and generates a double-rail 0 or 1 output [3]. If the Booth encoding is negative, it

generates the 1 output. The cell also passes the sum and carry outputs from mcell through to the

next column. Addcell connects to a column of rwire, which is a vertical wiring channel to

connect Addcell to the fast adder in the right hand column. Rwire has a pair of inverters to drive

7/30/2019 10.1.1.1.4951

29/112


The basic CSA blocks are csa_cond, csa_first, csa_mid, and csa_last [4]. Csa_cond is a sub-

component of the other three. It is a double-rail pass-transistor mux to compute the conditional

sum and carry bits. One must always use csa_first and csa_last. For a three or more bit adder,

one inserts the necessary number of csa_mid cells. We created three adder sizes, csa_2, csa_4

(and csa_4b), and csa_8. Each of these cells has a 2-inverter driver chain for the double-rail

carry-in input. This is necessary, since load varies widely between the three cells. The RSIM

estimates are 0.081pF, 0.133pF, and 0.243pF for the 2, 4, and 8-bit cells (see the Optimization

section below). The cells csa_2 and csa_4 are designed for use along the right side of the

multiplier. The cells csa_4b and csa_8 are designed for the bottom of the multiplier.

We had to make many substantial changes to the CSA designs in [1,2,4]. The original designes

used extensive pass-logic. RSIM showed many unknown errors in our original layouts. We

corrected some by inserting intermediate inverters. Other errors, which we originally thought

were problems with RSIM and pass logic, ended up being insufficient 1 drive from fa_tg for

7/30/2019 10.1.1.1.4951

30/112


The bottom row of mcell connects downward to a row of bwire, a wire routing channel. Below

the channel is a row of fa_tg2. These full adders sum the carry-out, sum-out, and W values for

each output bit. The output of the full adders then passes through the wiring chennel broute and

drives the bottom 16-bits of CSA adder. The last 16-bits of CSA adder are made up of a 4-4-8

design using csa_4b and csa_8.

1. capm2a .00003 ; 2nd metal cap -- area, pf/sq-micron2. capm2p .00020 ; 2nd metal cap -- perimeter, pf/micron3. capma .00006 ; 1st metal cap -- area, pf/sq-micron4. capmp .00020 ; 1st metal cap -- perimeter, pf/micron5. cappa .00005 ; poly cap -- area, pf/sq-micron6. cappp .00020 ; poly cap -- perimeter, pf/micron

7. capda .00030 ; n-diffusion cap -- area, pf/sq-micron8. capdp .00040 ; n-diffusion cap -- perimeter, pf/micron9. cappda .00050 ; p-diffusion cap -- area, pf/sq-micron10. cappdp .00040 ; p-diffusion cap -- perimeter, pf/micron

11. capga .00215 ; gate cap -- area, pf/sq-micron12. lambda 0.3 ; microns/lambda

13. lowthresh 0.4 ; logic low threshold as a normalized voltage14. highthresh 0.6 ; logic high threshold as a normalized voltage

15. cntpullup 016. diffperim 017. subparea 018. diffext 0

7/30/2019 10.1.1.1.4951

31/112


shown below. We used also used SPICE parameters to calculate the gate capacitance. Items 13

18 above were left as-is from the original PRM file. Items 19 26 came from MOSIS.

We calculated the gate capacitance and drain capacitance following the SPICE calculations

presented in [5, pp. 188ff]. Gate capacitance has two components, the intrinsic and extrinsic,

which are summed for the total. oxginC W L C = and

2gso gdo gbogex W LC W C C C + += . The parameters for the gate-source, gate-drain, and

gate-body capacitances came from the Hspice parameters. They are, respectively, 1.93 x10-10

F/m, 1.93x10-10 F/m, and 1.00 x10-9 F/m. The gate oxide thickness is 1.38x10-6 m. Since RSIM

uses a unit measurement per area, we set W and L to 1 . The drain capacitance is given by the

following, where CJ, VJ, PB, MJ, CJSW, and MJSW are SPICE parameters. Their values are

4.22E-4, 2.5, 0.984, 3.49E-10, 1.20E-1. We used an area of 1 and a perimeter of 4.

1 1MJ MJSW

j

VJ VJ Area CJ Perim CJSW

PB PBC

= + + +

7/30/2019 10.1.1.1.4951

32/112


RSIM. Long wires, such as the booth-encoded selectors, could range between 0.5 pF and 0.6 pF.

We generally fixed n based on layout considerations.

When generating double-rail signals from single-rail inputs, we usually use 2-inverter/3-inverter

trees or 3-inverter/4-inverter trees. Sometimes this was sub-optimal, since we used fewer but

larger inverters based on layout restrictions. The layout restrictions came from the standard cell

size we selected early in the design process.

The CSA adder is designed as a 2-2-4-4-4-8-8 chain, based on [2]. Using Magic estimates of

input capacitance for the carry-in, we designed an input driver for each of csa_2, csa_4, and

csa_8 to optimize the performance of each element. The component csa_last generates the

car_h, car_l carry outputs with a 6/6 inverter that then drives a 3/3 transmission gate for the

carry select. Thus, csa_last has low drive ability.

From Magic, the input capacitances of csa 2, csa 4, csa 8 are, respectively, 0.081pf, 0.133pf,

7/30/2019 10.1.1.1.4951

33/112


The 2-2-4-4-4-8-8 design assumes that all inputs arrive at the same time. In our multiplier case,

that is not true. The input to the first 8 bit adder actually arrives last. One might experiment

with different designs, such as 2-2-4-8-2-2-4-8.

We found that the ff output of the cell bth drove about 0.139 pF but only had a 12/16 output

inverter. We redesigned it as a 2-inverter chain of 4/6 and 12/18. Using a 4/6 rather than a 3/5

reduced the size of the second inverter by 2 . Going from a 28 of input capacitance down to

10 also helped. This one change improved performance by approximately 15% overall.

7/30/2019 10.1.1.1.4951

34/112


References

1. Abu-Khater, I.S.; Bellaouar, A.; Elmasry, M.I.; Yan, R.H., Circuit/architecture

for low-power high-performance 32-bit adder, Fifth Great Lakes Symposium on

VLSI, Buffalo, NY, USA, 16-18, March 1995 pp.74-7.

2. Abu-Khater, I.S.; Yan, R.H.; Bellaouar, A.; Elmasry, M.I., A 1-V low-power high-

performance 32-bit conditional sum adder, Symposium on Low Power

Electronics. Digest of Technical Papers, San Diego, CA, USA, 10-12 Oct. 1994,

pp.66-7.

3. Abu-Khater, I.S.; Yan, R.H.; Bellaouar, A.; Elmasry, M.I., Circuit Techniques for

CMOS Low-Power High-Performance Multipliers, IEEE Journal of Solid-State

Circuits, v. 31, no. 10, Oct 1996, pp. 1535 1546.

4. Bellaouar, A. and M.I. Elmasry, Low-Power Digital VLSI Design. Circuits and

Systems, Kluwer Academic Publishers, Boston: 1995.

5. Weste, N.H.E. and K. Eshraghian, Principles of CMOS VLSI Design. A systems

7/30/2019 10.1.1.1.4951

35/112


VHDL Source Code

Addce l l . vhd1. ------------------------------------------------------------------------2. -- Add Cell from "Low-power Digital VLSI Design" by3. -- Bellaouar and Elmasry.4. -- Returns 1 if Booth encoding is negative else 05. ------------------------------------------------------------------------6. library IEEE;7. use IEEE.std_logic_1164.all;8. use work.bth_types.all;

9.10. entity addcell is11. port (bth : in std_ulogic_vector(4 downto 0);12. sum : out std_ulogic);13. end addcell;14.15.16. -- description of adder using concurrent signal assignments17. architecture rtl of addcell is18. begin19. sum

7/30/2019 10.1.1.1.4951

36/112


Adder .vhd

1. ------------------------------------------------------------------------2. -- Single-bit adder3. ------------------------------------------------------------------------4.5. library IEEE, adk;6. use IEEE.std_logic_1164.all;7.8. entity adder is9. port ( a_h : in std_ulogic;10. b_h : in std_ulogic;11. c_h : in std_ulogic;

12. sum_h : out std_ulogic;13. car_h : out std_ulogic);14. end adder;15.16. architecture rtl of adder is17.18.19. component fadd1 is20. port (21. A : in STD_LOGIC;

22. B : in STD_LOGIC;23. CI : in STD_LOGIC;24. S : out STD_LOGIC;25. CO : out STD_LOGIC26. );27. end component;28.29. signal a : std_logic;30. signal b : std_logic;31. signal c : std_logic;32. signal s : std_logic;

33. signal t : std_logic;34.35. begin36. a

7/30/2019 10.1.1.1.4951

37/112


61. sum_h : out std_ulogic_vector(N downto 1);62. car_h : out std_ulogic);63. end adderN;

64.65. -- structural implementation of the N-bit adder66. architecture ripple of adderN is67. component adder68. port (a_h : in std_ulogic;69. b_h : in std_ulogic;70. c_h : in std_ulogic;71. sum_h : out std_ulogic;72. car_h : out std_ulogic);73. end component;74.

75. signal carry : std_ulogic_vector(0 to N);76. begin77. carry(0) b_h(I),

85. c_h => carry(I - 1),86. sum_h => sum_h(I),87. car_h => carry(I));88. end generate;89. end ripple;

7/30/2019 10.1.1.1.4951

38/112


Booth .vhd

1. ------------------------------------------------------------------------2. -- Constants used by Booth functions3. ------------------------------------------------------------------------4. library IEEE;5. use IEEE.std_logic_1164.all;6.7. package bth_types is8. constant bth_m1 : integer := 4;9. constant bth_m2 : integer := 3;10. constant bth_p2 : integer := 2;11. constant bth_p1 : integer := 1;

12. constant bth_z0 : integer := 0;13. end bth_types;14.15.16. ------------------------------------------17. -- Booth encoder for row j18. ------------------------------------------19. library IEEE;20. use IEEE.std_logic_1164.all;21. use work.bth_types.all;

22.23. entity booth_encode is24. port( in_h : in std_ulogic_vector (2 downto 0);25. bth_h : out std_ulogic_vector (4 downto 0));26. end booth_encode;27.28. architecture rtl of booth_encode is29. begin30. -- input "in_h" is Y(2i+1) Y(2i) Y(2i-1) MSB order31. -- See bth.vhd for booth types32. bth_h

7/30/2019 10.1.1.1.4951

39/112


61. end rtl;62.

7/30/2019 10.1.1.1.4951

40/112


claN.vhd

63. ------------------------------------------------------------------------64. -- N-bit Carry-Lookahead adder65. -- The width of the adder is determined by generic N66. -- From Altera examples67. ------------------------------------------------------------------------68. library IEEE;69. use IEEE.std_logic_1164.all;70. use work.adder;71.72. entity claN is73. generic(N : positive);

74. port (a_h : in std_ulogic_vector(N-1 downto 0);75. b_h : in std_ulogic_vector(N-1 downto 0);76. c_h : in std_ulogic;77. sum_h : out std_ulogic_vector(N-1 downto 0);78. car_h : out std_ulogic);79. end claN;80.81. architecture behavioral of claN is82. signal h_sum : std_ulogic_vector(N-1 downto 0);83. signal car_gen : std_ulogic_vector(N-1 downto 0);

84. signal car_prop : std_ulogic_vector(N-1 downto 0);85. signal car_intern : std_ulogic_vector(N-1 downto 1);86.87. begin88. h_sum

7/30/2019 10.1.1.1.4951

41/112


123. architecture behavioral of plusN is124. signal x : std_logic_vector(N-1 downto 0);125. signal y : std_logic_vector(N-1 downto 0);

126.127. signal w : std_logic_vector(N-1 downto 0);128. signal z : std_logic_vector(N-1 downto 0);129. signal a : signed (N-1 downto 0);130. signal b : signed (N-1 downto 0);131. signal c : signed (N-1 downto 0);132. signal s : signed (N-1 downto 0);133.134. signal t4_h : std_ulogic;135. signal t5_h : std_ulogic;136. begin

137. x

7/30/2019 10.1.1.1.4951

42/112


186. signal w : std_logic_vector(N downto 0);187. signal z : std_logic_vector(N downto 0);188. signal a : unsigned (N downto 0);

189. signal b : unsigned (N downto 0);190. signal c : unsigned (N downto 0);191. signal s : unsigned (N downto 0);192.193. begin194. x(N-1 downto 0)

7/30/2019 10.1.1.1.4951

43/112


dr iverN.vhd

1. ------------------------------------------------------------------------2. -- N-bit driver3. ------------------------------------------------------------------------4. library IEEE;5. use IEEE.std_logic_1164.all;6.7. entity buf is8. port ( signal Q : out std_ulogic;9. signal D : in std_ulogic);10. end buf;11.

12. architecture behavior of buf is13. begin14. Q

7/30/2019 10.1.1.1.4951

44/112


la tch.vhd

1. ------------------------------------------------------------------------2. -- N-bit LATCH with reset3. -- The width of the latch is determined by generic N4. ------------------------------------------------------------------------5.6. library IEEE;7. use IEEE.std_logic_1164.all;8.9. entity dffr_fall is10. port ( Rst : in std_ulogic;11. Clk : in std_ulogic;

12. signal D : in std_ulogic;13. signal Q : out std_ulogic);14. end dffr_fall;15.16. architecture behavior of dffr_fall is17. begin18. process(Rst, Clk, D)19. begin20. if Rst = '1' then21. Q

7/30/2019 10.1.1.1.4951

45/112


61. signal D : in std_ulogic;62. signal Q : out std_ulogic);63. end dffr_rise;

64.65. architecture behavior of dffr_rise is66. begin67. process(Rst, Clk, D)68. begin69. if Rst = '1' then70. Q

7/30/2019 10.1.1.1.4951

46/112


124. end component;125.126. begin

127. gen: for j in 0 to N-1 generate128. dffgen: dffr_fall port map (Rst=> Rst, Clk=> Clk, D=> D(j), Q=> Q(j));129. end generate;130. end behavior;131.132. ------------------------------------------------------133.134. library IEEE;135. use IEEE.std_logic_1164.all;136.137. entity dffrN_rise is

138. generic(N : positive);139. port ( Rst : in std_ulogic;140. Clk : in std_ulogic;141. signal D : in std_ulogic_vector(N-1 downto 0);142. signal Q : out std_ulogic_vector(N-1 downto 0));143. end dffrN_rise;144.145. architecture behavior of dffrN_rise is146. component dffr_rise is147. port ( Rst : in std_ulogic;148. Clk : in std_ulogic;149. signal D : in std_ulogic;150. signal Q : out std_ulogic);151. end component;152.153. begin154. gen: for j in 0 to N-1 generate155. dffgen: dffr_rise port map (Rst=> Rst, Clk=> Clk, D=> D(j), Q=> Q(j));156. end generate;157. end behavior;158.

159. library IEEE;160. use IEEE.std_logic_1164.all;161.162. entity latchr is163 t ( R t i td l i

7/30/2019 10.1.1.1.4951

47/112


187. Clk : in std_ulogic;188. signal D : in std_ulogic_vector(N-1 downto 0);189. signal Q : out std_ulogic_vector(N-1 downto 0));

190. end latchrN;191.192. architecture behavior of latchrN is193. component latchr is194. port ( Rst : in std_ulogic;195. Clk : in std_ulogic;196. signal D : in std_ulogic;197. signal Q : out std_ulogic);198. end component;199.200. signal my_clk : std_logic_vector(N/8 downto 0);

201. signal my_rst : std_logic_vector(N/8 downto 0);202.203. begin204. process (Clk)205. begin206. clk_buf: for i in 0 to N/8 LOOP207. my_clk(i) my_clk(j/8), D=> D(j),

Q=> Q(j));220. end generate;

221. end behavior;222.223. ------------------------------------------------------------------------224. -- N-bit dff with reset : NON-TRANSPARENT ON GATED BUFFER225 Th idth f th dff i d t i d b i N

7/30/2019 10.1.1.1.4951

48/112


249. begin250. dff: latchr port map ( Rst=> Rst, Clk=> Clk, D=> D, Q=> w );251. Q

7/30/2019 10.1.1.1.4951

49/112


mul t . vhd1. ------------------------------------------------------------------------2. -- N-bit multiplier Multiplier3. -- This is a phi-2 device.4. --5. -- BusIO_S2H is the pad i/o bus6. -- Ovrflw_s2h is the overflow output. Should be made an InOut for carryin7. -- BusSEL_S2H is a chip select, encoded active high8. -- BS_S2H is the input select (bus high, mult low)9. -- RW_S2H is the Read/Write select (read high, write low)10. -- ME_S2H is the Multiplier Enable11. -- Rst_S2H is a reset signal. It is clocked with PHI_2 to ensure

12. -- that it does not muck with stuff when it is not supposed to13. -- Reset is immediate. There is no 1 cycle delay, like14. -- with regular signals.15. ------------------------------------------------------------------------16. library IEEE;17. use IEEE.std_logic_1164.all;18. --use work.converts.all;19.20. entity mult is21. port ( BusIO_S2H : inout std_logic_vector(15 downto 0);22. Ovrflw_S2H : out std_ulogic;23. BusSEL_S2H : in std_ulogic_vector(2 downto 0);24. BS_S2H : in std_ulogic;25. RW_S2H : in std_ulogic;26. ME_S2H : in std_ulogic;27. Rst_S2H : in std_ulogic;28. PHI_1H : in std_ulogic;29. PHI_2H : in std_ulogic);30. end mult;31.32. architecture structural of mult is

33.34. -- A multiplier register of width N35. component multregn is36. generic(N : positive );37 t ( B OUT S2H t td l i t (N 1 d t 0)

7/30/2019 10.1.1.1.4951

50/112


61. component mult_pipe is62. port( z_v2h : out std_ulogic_vector(7 downto 0);63. a_v2h : out std_ulogic_vector(23 downto 0);64. b_v2h : out std_ulogic_vector(23 downto 0);65. c_v2h : out std_ulogic;66. ovrflw_v2h : out std_ulogic_vector(2 downto 0);67. medly_s2h : out std_ulogic;68. x_s2h : in std_ulogic_vector(15 downto 0);69. y_s2h : in std_ulogic_vector(15 downto 0);70. w_s2h : in std_ulogic_vector(31 downto 0);71. me_s2h : in std_ulogic;72. PHI_1H : in std_ulogic;73. PHI_2H : in std_ulogic;74. Rst_s2h : in std_ulogic

75. );76. end component;77.78. -- single bit D flip flop79. component dffr_fall is80. port ( Rst : in std_ulogic;81. Clk : in std_ulogic;82. signal D : in std_ulogic;83. signal Q : out std_ulogic);84. end component;85.86. component buf is87. port ( Q : out std_ulogic;88. D : in std_ulogic);89. end component;90.91. -- Buses to/from the multiplier from the registers92. signal bus_x : std_ulogic_vector(15 downto 0);93. signal bus_y : std_ulogic_vector(15 downto 0);94. signal bus_w : std_ulogic_vector(31 downto 0);95. signal bus_z : std_ulogic_vector(31 downto 0);

96.97. -- wiring from multiplier to CLA unit98. signal bus_a : std_ulogic_vector(23 downto 0);99. signal bus_b : std_ulogic_vector(23 downto 0);100 i l b td l i

7/30/2019 10.1.1.1.4951

51/112


124.125. -- temporary signals used to compute overflow126. signal t1_h : std_ulogic;127. signal t2_h : std_ulogic;128. signal t3_h : std_ulogic;129. signal t4_h : std_ulogic;130. signal t5_h : std_ulogic;131.132. -- outputs from regsiters133. signal feed_r0 : std_ulogic_vector(15 downto 0);134. signal feed_r1 : std_ulogic_vector(15 downto 0);135. signal feed_r2 : std_ulogic_vector(15 downto 0);136. signal feed_r3 : std_ulogic_vector(15 downto 0);137. signal feed_r4 : std_ulogic_vector(15 downto 0);

138. signal feed_r5 : std_ulogic_vector(15 downto 0);139.140. -- buffered clocks141. signal phi_a_1h : std_ulogic_vector(6 downto 0);142. signal phi_a_2h : std_ulogic_vector(6 downto 0);143.144. begin145. ---------------------------------------------------------------146. -- Decode the input register select147. bus_sel_h

7/30/2019 10.1.1.1.4951

52/112


181. port map ( BusOUT_S2H => feed_r0,182. BusIN_S2H => To_StdULogicVector(busio_s2h),183. MultOut_S2H => bus_x,184. MultIn_V2H => Gnd_16,185. Sel_s2h => bus_sel_h(0),186. BS_S2H => Vdd,187. RW_S2H => RW_S2H,188. MEDLY_S2H => MEDLY_Q2H,189. RST_S2H => RST_S2H,190. PHI_1H => PHI_1H,191. PHI_2H => PHI_2H);192.193. -- R1 is the Y register194. -- R1 never reads from the multiplier (BS = Vdd, MultIn = GND)

195. reg_1: multregN196. generic map (16)197. port map ( BusOUT_S2H => feed_r1,198. BusIN_S2H => To_StdULogicVector(busio_s2h),199. MultOut_S2H => bus_y,200. MultIn_V2H => Gnd_16,201. Sel_s2h => bus_sel_h(1),202. BS_S2H => Vdd,203. RW_S2H => RW_S2H,204. MEDLY_S2H => MEDLY_Q2H,205. RST_S2H => RST_S2H,206. PHI_1H => PHI_1H,207. PHI_2H => PHI_2H);208.209. -- R2 is the W(31:16) register210. -- R2 never reads from the multiplier (BS = Vdd, MultIn = GND)211. reg_2: multregN212. generic map (16)213. port map ( BusOUT_S2H => feed_r2,214. BusIN_S2H => To_StdULogicVector(busio_s2h),215. MultOut_S2H => bus_w(31 downto 16),

216. MultIn_V2H => Gnd_16,217. Sel_s2h => bus_sel_h(2),218. BS_S2H => Vdd,219. RW_S2H => RW_S2H,220 MEDLY S2H > MEDLY Q2H

7/30/2019 10.1.1.1.4951

53/112


244. generic map (16)245. port map ( BusOUT_S2H => feed_r4,246. BusIN_S2H => To_StdULogicVector(busio_s2h),247. MultIn_V2H => bus_z(31 downto 16),248. Sel_s2h => bus_sel_h(4),249. BS_S2H => BS_S2H,250. RW_S2H => RW_S2H,251. MEDLY_S2H => MEDLY_Q2H,252. RST_S2H => RST_S2H,253. PHI_1H => PHI_1H,254. PHI_2H => PHI_2H);255.256. -- R5 is the Z(15:0) register257. -- R4 & R5 have no MultOut connections

258. reg_5: multregN259. generic map (16)260. port map ( BusOUT_S2H => feed_r5,261. BusIN_S2H => To_StdULogicVector(busio_s2h),262. MultIn_V2H => bus_z(15 downto 0),263. Sel_s2h => bus_sel_h(5),264. BS_S2H => BS_S2H,265. RW_S2H => RW_S2H,266. MEDLY_S2H => MEDLY_Q2H,267. RST_S2H => RST_S2H,268. PHI_1H => PHI_1H,269. PHI_2H => PHI_2H);270.271. ---------------------------------------------------------------272. -- Storage for the Overflow output273. ---------------------------------------------------------------274. Rst_q2h ovrflw_v2h, Clk=> MEDLY_Q2H, Rst=> Rst_q2h);

279.280. -- allows us to monitor ovrflw_s2h without using a buffered I/O pin281. ovrflw_s2h

7/30/2019 10.1.1.1.4951

54/112


307.308. cla_0 : mult_cla309. generic map (24)310. port map (311. z_v2h => bus_z(31 downto 8),312. car_v2h => car_out,313. a_v2h => bus_a,314. b_v2h => bus_b,315. c_v2h => bus_c316. );317.318. ---------------------------------------------------------------319. -- Compute the overflow320. -- An overflow is defined when

321. -- 1) x*y > 0 and w > 0 and z < 0 or322. -- 2) x*y < 0 and w < 0 and z > 0323. --324. ----------------------------------------------------------------325.326. t1_h

7/30/2019 10.1.1.1.4951

55/112


mul t_c la .vhd1. ------------------------------------------------------------------------2. -- 24-bit CLA as separate entity for synthesis3. --4. ------------------------------------------------------------------------5.6. library IEEE;7. use IEEE.std_logic_1164.all;8.9. entity mult_cla is10. generic (N : positive );11.

12. port( z_v2h : out std_ulogic_vector(N-1 downto 0);13. car_v2h : out std_ulogic;14. a_v2h : in std_ulogic_vector(N-1 downto 0);15. b_v2h : in std_ulogic_vector(N-1 downto 0);16. c_v2h : in std_ulogic17. );18. end mult_cla;19.20. architecture rtl of mult_cla is21. component plusN is22. generic( N : positive);23. port ( a_h : in std_ulogic_vector(N-1 downto 0);24. b_h : in std_ulogic_vector(N-1 downto 0);25. c_h : in std_ulogic;26. sum_h : out std_ulogic_vector(N-1 downto 0);27. car_h : out std_ulogic);28. end component;29.30. component claN is31. generic( N : positive);32. port ( a_h : in std_ulogic_vector(N-1 downto 0);

33. b_h : in std_ulogic_vector(N-1 downto 0);34. c_h : in std_ulogic;35. sum_h : out std_ulogic_vector(N-1 downto 0);36. car_h : out std_ulogic);37 d t


7/30/2019 10.1.1.1.4951

56/112


December 1, 2000 Page 54

mul t_p ipe .vhd1. ------------------------------------------------------------------------2. -- Booth encoded carry-save-adder array3. --4. -- From "Low-power Digital VLSI Design" by Bellaouar and Elmasry.

5. -- and6. -- "Circuit Techniques for CMOS Low-Power High-Performance Multipliers"7. -- by Abu-Khater, Bellaouar, Elmasry in IEEE J. Solid-State Circuits v.31 (10)8. -- Oct 1996 pp. 1535ff9. --10. -- z_v2h Multiply accumulate output (x * y + w) (only low-order 8 bits)11. -- a_v2h goes to fast adder for high-order 24-bits12. -- b_v2h13. -- c_v2h14. -- ovrflow_v2h 3-bits to compute overflow (w[31] x[31] y[31])15. -- medly_s2h Output good at end of phase (see me_s2h, this is delayed)16. -- x_s2h multiplicand17. -- y_s2h multiplier (gets booth encoded)18. -- w_s2h accumulate19. -- me_s2h multiplier enable20. -- PHI_1H clock21. -- PHI_2H clock22. -- Rst_s2h Reset internal registers to 023. --24. -- The Y inputs are booth encoded then gated until ME_S2H & PHI_2H.25. -- The Y inputs should be applied first to give the booth encoders time26. -- to settle. The Y inputs must remain valid until MEDLY_S2H (actually27. -- until a 1/2 cycle before...)28. ------------------------------------------------------------------------29. ------------------------------------------------------------------------30. -- Variables are generally named as follows:31. -- name_PtCl

32. --33. -- P = pipe line stage (1, 2, or 3)34. -- t = type (s,q,v)35. -- C = clock phase (1 or 2)36. -- l = logic (L or H)37. --38. -- examples:39. -- sum_0_1v2h = row 0 sum 1st pipe stage, V timing, Phi-2, active high40. --41. -- Rules:42. -- Variables can only be assigned if P and C the same:43. -- x_1v2h

7/30/2019 10.1.1.1.4951

57/112



44. --45. -- To go between phases/stages you need to use a storage device:46. --47. -- gdffr_fall(Q=> x_2v1h, D=> x_1v2h, Clk=> mdly_q2h, Enable=> mdly_q1h)48. -- This clocks in x_1v2h on mdly_q2h and49. -- enables the output to x_2v1h on mdly_q1h50. --

51. ------------------------------------------------------------------------52. library IEEE;53. use IEEE.std_logic_1164.all;54. --use work.converts.all;55.56. -- We use a fixed width / height for simplicity.57. -- Overflow = x*y + w out of range58.59. entity mult_pipe is60. port( z_v2h : out std_ulogic_vector(7 downto 0);61. a_v2h : out std_ulogic_vector(23 downto 0);62. b_v2h : out std_ulogic_vector(23 downto 0);63. c_v2h : out std_ulogic;64. ovrflw_v2h : out std_ulogic_vector(2 downto 0);65. medly_s2h : out std_ulogic;66. x_s2h : in std_ulogic_vector(15 downto 0);67. y_s2h : in std_ulogic_vector(15 downto 0);68. w_s2h : in std_ulogic_vector(31 downto 0);69. me_s2h : in std_ulogic;70. PHI_1H : in std_ulogic;71. PHI_2H : in std_ulogic;72. Rst_s2h : in std_ulogic73. );74. end mult_pipe;75.76. architecture rtl of mult_pipe is77. constant COL : integer := 16;

78. constant ROW : integer := 8;79.80. -- AddCell will add a 0/1 to each row depending on the sign81. -- of the booth encoding.82. component addcell is83. port ( bth : in std_ulogic_vector(4 downto 0);84. sum : out std_ulogic);85. end component;86.87. -- A standard full adder88. component adder is89. port ( a_h : in std_ulogic;


7/30/2019 10.1.1.1.4951

58/112



90. b_h : in std_ulogic;91. c_h : in std_ulogic;92. sum_h : out std_ulogic;93. car_h : out std_ulogic);94. end component;95.96. -- unsigned addition

97. component uplusN is98. generic( N : positive);99. port ( a_h : in std_ulogic_vector(N-1 downto 0);100. b_h : in std_ulogic_vector(N-1 downto 0);101. c_h : in std_ulogic;102. sum_h : out std_ulogic_vector(N-1 downto 0);103. car_h : out std_ulogic);104. end component;105.106. -- A standard full adder 15 bits wide107. component adderN is108. generic( N : positive);109. port ( a_h : in std_ulogic_vector(N-1 downto 0);110. b_h : in std_ulogic_vector(N-1 downto 0);111. c_h : in std_ulogic;112. sum_h : out std_ulogic_vector(N-1 downto 0);113. car_h : out std_ulogic);114. end component;115.116. -- Generate a 5-line demultiplexed booth encoding of 3 input bits117. component booth_encode is118. port( in_h : in std_ulogic_vector (2 downto 0);119. bth_h : out std_ulogic_vector (4 downto 0));120. end component;121.122. -- Partial product generator with full adder123. -- Has only SUM (and carry) out

124. component ppfa is125. port ( bth : in std_ulogic_vector(4 downto 0);126. x1_h : in std_ulogic;127. x2_h : in std_ulogic;128. s0_h : in std_ulogic;129. c0_h : in std_ulogic;130. sum_h : out std_ulogic;131. ca1_h : out std_ulogic);132. end component;133.134. -- Partial product generator with full adder135. -- Has both PP out and SUM (and carry) out


7/30/2019 10.1.1.1.4951

59/112



136. component ppfapp is137. port ( bth : in std_ulogic_vector(4 downto 0);138. x1_h : in std_ulogic;139. x2_h : in std_ulogic;140. s0_h : in std_ulogic;141. c0_h : in std_ulogic;142. pp_h : out std_ulogic;

143. sum_h : out std_ulogic;144. ca1_h : out std_ulogic);145. end component;146.147. -- Sign extender. Computes sign bits to pass to next row.148. -- Adds 2 bits per row. "ff" is the "flag" bit.149. component sgn is150. port ( pp_h : in std_ulogic;151. ff_h : in std_ulogic;152. pp_out_h: out std_ulogic;153. ff_out_h: out std_ulogic);154. end component;155.156. -- D flip flop with reset157. component dffr_fall is158. port ( Rst : in std_ulogic;159. Clk : in std_ulogic;160. signal D : in std_ulogic;161. signal Q : out std_ulogic);162. end component;163.164. component dffrN_fall is165. generic(N : positive );166. port ( Rst : in std_ulogic;167. Clk : in std_ulogic;168. signal D : in std_ulogic_vector(N-1 downto 0);169. signal Q : out std_ulogic_vector(N-1 downto 0));

170. end component;171.172. -- a gated flipflop173. component gdffr_fall is174. port ( Rst : in std_ulogic;175. Clk : in std_ulogic;176. Enable : in std_ulogic;177. signal D : in std_ulogic;178. signal Q : out std_ulogic);179. end component;180.181. -- an N-bit gated flipflop


7/30/2019 10.1.1.1.4951

60/112



182. component gdffrN_fall is183. generic(N : positive );184. port ( Rst : in std_ulogic;185. Clk : in std_ulogic;186. Enable : in std_ulogic;187. signal D : in std_ulogic_vector(N-1 downto 0);188. signal Q : out std_ulogic_vector(N-1 downto 0));

189. end component;190.191. -- These are the outputs from the sign extenders192. -- one for each row193. -- pp15 is the pp output of the 15th column of each row194. -- we need 9 sets of wires since we have inputs to row 0 and outputs from row 7195.196. -- v2 signals in 1st pipe stage, v1 signals in 2nd197. -- (pp1 = 1st stage, pp2 = 2nd, pp3 = 3rd)198.199. -- There is some overlap here, since in PHI2 we generate pp1_v2h(4) which200. -- is then latech to PHI1201. signal pp_1v2h : std_ulogic_vector(4 downto 0);202. signal ff_1v2h : std_ulogic_vector(4 downto 0);

203. signal pp15_1v2h: std_ulogic_vector(4 downto 0);204.205. signal pp_2v1h : std_ulogic_vector(8 downto 4);206. signal ff_2v1h : std_ulogic_vector(8 downto 4);207. signal pp15_2v1h: std_ulogic_vector(7 downto 4);208.209. signal pp_3v2h : std_ulogic_vector(8 downto 8);210.211. -- each row has an output from the addcell212. signal add_1v2h : std_ulogic_vector(3 downto 0);213. signal add_2v1h : std_ulogic_vector(7 downto 0);214.215. -- these are a cycle later

216. signal add_3v2h : std_ulogic_vector(7 downto 4);217.218. -- each row gets own array. Don't try 2-dimension array.219. -- sum_x_h is the sum output of each column in row X.220. -- ca1_x_h is the carry output of each column in row X.221. -- pre_A_h is the booth encoding for row A before the gate222. -- bth_A_h is the booth encoding for row A after the gate223.224. -- The V2H signals are outputs from the multiplier body225. -- the S1H signals are outputs from the 1st pipeline registers226. -- the V1H signals are outputs from the 1st pipeline gates227.


7/30/2019 10.1.1.1.4951

61/112



228. signal sum_0_1v2h : std_ulogic_vector(COL downto 0);229. signal car_0_1v2h : std_ulogic_vector(COL downto 0);230. signal sum_0_2v1h : std_ulogic_vector(1 downto 0);231. signal car_0_2v1h : std_ulogic;232. signal bth_pre_0_h : std_ulogic_vector(4 downto 0);233. signal bth_0_1v2h : std_ulogic_vector(4 downto 0);234.

235. signal sum_1_1v2h : std_ulogic_vector(COL downto 0);236. signal car_1_1v2h : std_ulogic_vector(COL downto 0);237. signal sum_1_2v1h : std_ulogic_vector(1 downto 0);238. signal car_1_2v1h : std_ulogic;239. signal bth_pre_1_h : std_ulogic_vector(4 downto 0);240. signal bth_1_1v2h : std_ulogic_vector(4 downto 0);241.242. signal sum_2_1v2h : std_ulogic_vector(COL downto 0);243. signal car_2_1v2h : std_ulogic_vector(COL downto 0);244. signal sum_2_2v1h : std_ulogic_vector(1 downto 0);245. signal car_2_2v1h : std_ulogic;246. signal bth_pre_2_h : std_ulogic_vector(4 downto 0);247. signal bth_2_1v2h : std_ulogic_vector(4 downto 0);248.

249. signal sum_3_1v2h : std_ulogic_vector(COL downto 0);250. signal car_3_1v2h : std_ulogic_vector(COL downto 0);251. signal sum_3_2v1h : std_ulogic_vector(COL downto 0);252. signal car_3_2v1h : std_ulogic_vector(COL downto 0);253. signal bth_pre_3_h : std_ulogic_vector(4 downto 0);254. signal bth_3_1v2h : std_ulogic_vector(4 downto 0);255.256. -- The V1H signals are outputs from the multiplier body257. -- the S2H signals are outputs from the 2st pipeline registers258. -- the V2H signals are outputs from the 2st pipeline gates259.260. signal sum_4_2v1h : std_ulogic_vector(COL downto 0);261. signal car_4_2v1h : std_ulogic_vector(COL downto 0);

262. signal sum_4_3v2h : std_ulogic_vector(1 downto 0);263. signal car_4_3v2h : std_ulogic;264. signal bth_pre_4_h : std_ulogic_vector(4 downto 0);265. signal bth_4_2v1h : std_ulogic_vector(4 downto 0);266.267. signal sum_5_2v1h : std_ulogic_vector(COL downto 0);268. signal car_5_2v1h : std_ulogic_vector(COL downto 0);269. signal sum_5_3v2h : std_ulogic_vector(1 downto 0);270. signal car_5_3v2h : std_ulogic;271. signal bth_pre_5_h : std_ulogic_vector(4 downto 0);272. signal bth_5_2v1h : std_ulogic_vector(4 downto 0);273.


7/30/2019 10.1.1.1.4951

62/112



274. signal sum_6_2v1h : std_ulogic_vector(COL downto 0);275. signal car_6_2v1h : std_ulogic_vector(COL downto 0);276. signal sum_6_3v2h : std_ulogic_vector(1 downto 0);277. signal car_6_3v2h : std_ulogic;278. signal bth_pre_6_h : std_ulogic_vector(4 downto 0);279. signal bth_6_2v1h : std_ulogic_vector(4 downto 0);280.

281. signal sum_7_2v1h : std_ulogic_vector(COL downto 0);282. signal car_7_2v1h : std_ulogic_vector(COL downto 0);283. signal sum_7_3v2h : std_ulogic_vector(COL downto 0);284. signal car_7_3v2h : std_ulogic_vector(COL downto 0);285. signal bth_pre_7_h : std_ulogic_vector(4 downto 0);286. signal bth_7_2v1h : std_ulogic_vector(4 downto 0);287.288. -- The first 15 bits go into a full adder array.289. -- The last 17 bits go into a 42 compressor array with W()290. --291. -- These are the a_h() and b_h() inputs and the carry output292. signal fa_a_2v1h : std_ulogic_vector(7 downto 0);293. signal fa_b_2v1h : std_ulogic_vector(7 downto 0);294. signal fa_car_2v1h : std_ulogic;

295.296. -- these feed the 24-bit CLA297. -- fa_a_3 is (32 - 8) to accomodate an extra carry bit that we do not use298. signal fa_a_3v2h : std_ulogic_vector(32 downto 8);299. signal fa_b_3v2h : std_ulogic_vector(31 downto 8);300. signal fa1_car_3v2h : std_ulogic;301.302. -- The carry outputs of bit 16's compressor (no longer use 4:2 compressors, but303. - the name is the same...)304. --signal comp_ca1_3v2h: std_ulogic;305. --signal comp_ca2_3v2h: std_ulogic;306.307. -- b input and Carry outputs of the 42 compressor array

308. -- cout_out_h is the output of the 42 compressors (since z_v2h309. -- is not inout or buffered) no longer use 42 compressors, but name is the same.310. signal comp_b_3v2h : std_ulogic_vector(15 downto 0);311. signal comp_out_3v2h: std_ulogic_vector(31 downto 0);312.313. -- some miscellaneous signals used to compute the overflow314.315. constant GND : std_ulogic := '0';316. constant VDD : std_ulogic := '1';317.318. -- a modified version of x_s2h to align with the times 2 needed for booth319. -- Use a tempx as the bit-sliced version then assign whole to myx_v2h


7/30/2019 10.1.1.1.4951

63/112

C 3 oo u p e c os o


320. -- A ModelSim technote said this was the way to do it....321. -- We need to pad with a "0" on the right and duplicate x_s2h(15) on left322. -- myx_v2h is also gated on ME_Q2H323. -- myx_s1h/v1h is latched/gated on MDLY_Q1H in the 2nd pipeline stage324. signal myx_1v2h : std_ulogic_vector(COL+1 downto 0);325. signal myx_2v1h : std_ulogic_vector(COL+1 downto 0);326.

327. signal myx_3v2h: std_ulogic; -- needed in 3rd pipeline stage328.329. signal myy_2s1h: std_ulogic_vector(COL-1 downto 7);330. signal myy_3v2h: std_ulogic; -- needed in 3rd pipeline stage331.332. signal tempx : std_ulogic_vector(COL+1 downto 0);333.334. -- The W signal is gated in three places.335. -- W[14:0] is gated on ME_Q2H336. -- W[15] is gated on MDLY_Q1H337. -- W[31:16] is gated on MDLY_Q2H338. -- the array indicies are to keep them the same as w_s2h339.340. signal w_1v2h : std_ulogic_vector(31 downto 0);

341. signal w_2v1h : std_ulogic_vector(31 downto 15);342. signal w_3v2h : std_ulogic_vector(31 downto 15);343.344.345. -- a temp signal array for the Y input to row 0 booth encoder.346. signal y0_in : std_ulogic_vector(2 downto 0);347.348. -- timing signals for pipeline registers and gates349. signal me_1q2h : std_ulogic;350. signal me_2s1h : std_ulogic;351. signal me_2q1h : std_ulogic;352. signal me_3s2h : std_ulogic;353. signal me_3q2h : std_ulogic;

354.355. -- Internally guarded RESET on PHI_2356. signal rst_q2h : std_ulogic;357.358. -- The 1st 8 bits of z are generated in the 2nd pipeline stage359. signal z_2v1h : std_ulogic_vector(7 downto 0);360.361. signal DBG_EN : std_ulogic := '0';362.363. begin364.365. -- Generate the internal reset signal


7/30/2019 10.1.1.1.4951

64/112

p


366. rst_q2h rst_q2h);373. dff_clk1: dffr_fall port map( D => me_2s1h, Q=> me_3s2h, CLK=> phi_1h, Rst=> rst_q2h);374.375. medly_s2h w_s2h(31 downto 15), Clk=> me_1q2h, Rst=> Rst_q2h);384.385. wlatch_2: dffrN_fall generic map(17)

386. port map (Q=> w_3v2h, D=> w_2v1h(31 downto 15), Clk=> me_2q1h, Rst=> Rst_q2h);387.388. -- ff_h(0) is always 0389. ff_1v2h(0)

7/30/2019 10.1.1.1.4951

65/112

p


411. ----------------------------------------------------------------412.413. ----------------------------------------------------------------414. -- 1) Generate the sign extender cells, one cell per row415. ----------------------------------------------------------------416. -- There is one sign cell per row417. COLGEN1: for i in 0 to 3 generate

418. sgncell : sgn port map( pp_h => pp15_1v2h(i), ff_h => ff_1v2h(i),419. pp_out_h => pp_1v2h(i+1), ff_out_h => ff_1v2h(i+1) );420. end generate;421.422. pipe_pp2: gdffr_fall port map ( Q=> pp_2v1h(4), D=> pp_1v2h(4),423. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);424. pipe_ff2: gdffr_fall port map ( Q=> ff_2v1h(4), D=> ff_1v2h(4),425. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);426.427. COLGEN2: for i in 4 to 7 generate428. sgncell : sgn port map( pp_h => pp15_2v1h(i), ff_h => ff_2v1h(i),429. pp_out_h => pp_2v1h(i+1), ff_out_h => ff_2v1h(i+1) );430. end generate;431.

432. pipe_pp3: gdffr_fall port map ( Q=> pp_3v2h(8), D=> pp_2v1h(8),433. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);434. --pipe_ff3: gdffr_fall port map ( Q=> ff_3v2h(8), D=> ff_2v1h(8),435. -- Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);436.437.438. ----------------------------------------------------------------439. -- 2) The booth encoders, one cell per row440. ----------------------------------------------------------------441. -- Generate each Booth encoders, one per row. Note that row 0 is special442. -- and pads a "0" as LSB.443. y0_in(2 downto 1) bth_pre_0_h);447. bth_1 : booth_encode port map ( in_h => y_s2h(3 downto 1), bth_h => bth_pre_1_h);448. bth_2 : booth_encode port map ( in_h => y_s2h(5 downto 3), bth_h => bth_pre_2_h);449. bth_3 : booth_encode port map ( in_h => y_s2h(7 downto 5), bth_h => bth_pre_3_h);450.451. -- Delay y_s2h(15 downto 7) until stage 2452.453. bth_4 : booth_encode port map ( in_h => myy_2s1h(9 downto 7), bth_h => bth_pre_4_h);454. bth_5 : booth_encode port map ( in_h => myy_2s1h(11 downto 9), bth_h => bth_pre_5_h);455. bth_6 : booth_encode port map ( in_h => myy_2s1h(13 downto 11), bth_h => bth_pre_6_h);456. bth_7 : booth_encode port map ( in_h => myy_2s1h(15 downto 13), bth_h => bth_pre_7_h);


7/30/2019 10.1.1.1.4951

66/112

p


457.458. -- Pass the booth encoding through the gated drivers459. --bth_0_1v2h '0');460. --bth_1_1v2h '0');461. --bth_2_1v2h '0');462. --bth_3_1v2h '0');463. --bth_4_2v1h '0');

464. --bth_5_2v1h '0');465. --bth_6_2v1h '0');466. --bth_7_2v1h '0');467. bth_0_1v2h Rst_q2h);482.483. ----------------------------------------------------------------484. -- 3) The add cells, one per row485. ----------------------------------------------------------------486. -- The Add Cells get mixedup on the indicies, since booth encoding is487. -- not a row array. Easiest to just declare each out outside a generate loop488. addcell_0 : addcell port map ( bth => bth_0_1v2h, sum => add_1v2h(0) );489. addcell_1 : addcell port map ( bth => bth_1_1v2h, sum => add_1v2h(1) );490. addcell_2 : addcell port map ( bth => bth_2_1v2h, sum => add_1v2h(2) );

491. addcell_3 : addcell port map ( bth => bth_3_1v2h, sum => add_1v2h(3) );492. addcell_4 : addcell port map ( bth => bth_4_2v1h, sum => add_2v1h(4) );493. addcell_5 : addcell port map ( bth => bth_5_2v1h, sum => add_2v1h(5) );494. addcell_6 : addcell port map ( bth => bth_6_2v1h, sum => add_2v1h(6) );495. addcell_7 : addcell port map ( bth => bth_7_2v1h, sum => add_2v1h(7) );496.497. -- Delay the first 4 to 2nd stage498. gadd1: gdffrN_fall generic map(4)499. port map( Q=> add_2v1h(3 downto 0), D=>add_1v2h,500. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);501.502. -- Delay the last 4 to 3nd stage


7/30/2019 10.1.1.1.4951

67/112


503. gadd2: gdffrN_fall generic map(4)504. port map( Q=> add_3v2h(7 downto 4), D=>add_2v1h(7 downto 4),505. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);506.507. ----------------------------------------------------------------508. -- 4) The Multiplier body, 16 columns by 8 rows509. ----------------------------------------------------------------

510. -- i is the column511. ROWGEN: for i in 0 to COL generate512. -- for the PPFA cells, columns 0 to 14 use regular PPFA cells513. -- column 15 uses the PPFAPP cell which has a tap on the PP output of514. -- the mux. This is needed to do the sign extension.515. --516. -- So, ppfa_0(5), for example, would be column 5 of row 0517.518. -- The first 15 columns get sum/carry inputs from previous row519. -- Columns 15 and 16 get special wiring from the sign extenders520. -- Column 16 also uses the PPFAPP cells521.522. G0: if( i < COL-1 ) generate523. -- Row 0 is special and gets W() inputs

524. ppfa_0: ppfa port map( bth => bth_0_1v2h,525. x1_h => myx_1v2h(i+1),526. x2_h => myx_1v2h(i),527. s0_h => w_1v2h(i),528. c0_h => GND,529. sum_h => sum_0_1v2h(i),530. ca1_h => car_0_1v2h(i));531.532. -- All other rows get s0_h from 2 columns left and533. -- c0_h from 1 column left from the previous row.534.535. ppfa_1: ppfa port map( bth => bth_1_1v2h,536. x1_h => myx_1v2h(i+1),537. x2_h => myx_1v2h(i),538. s0_h => sum_0_1v2h(i+2),539. c0_h => car_0_1v2h(i+1),540. sum_h => sum_1_1v2h(i),541. ca1_h => car_1_1v2h(i));542.543. ppfa_2: ppfa port map( bth => bth_2_1v2h,544. x1_h => myx_1v2h(i+1),545. x2_h => myx_1v2h(i),546. s0_h => sum_1_1v2h(i+2),547. c0_h => car_1_1v2h(i+1),548. sum_h => sum_2_1v2h(i),


7/30/2019 10.1.1.1.4951

68/112


549. ca1_h => car_2_1v2h(i));550.551. ppfa_3: ppfa port map( bth => bth_3_1v2h,552. x1_h => myx_1v2h(i+1),553. x2_h => myx_1v2h(i),554. s0_h => sum_2_1v2h(i+2),555. c0_h => car_2_1v2h(i+1),

556. sum_h => sum_3_1v2h(i),557. ca1_h => car_3_1v2h(i));558.559. p00_sum1 : gdffr_fall port map560. ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);561.562. p00_car1 : gdffr_fall port map563. ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);564.565. -- use the value before the tri-state566. p00_x1 : gdffr_fall port map567. ( Q=> myx_2v1h(i), D=> tempx(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);568.569. ppfa_4: ppfa port map( bth => bth_4_2v1h,

570. x1_h => myx_2v1h(i+1),571. x2_h => myx_2v1h(i),572. s0_h => sum_3_2v1h(i+2),573. c0_h => car_3_2v1h(i+1),574. sum_h => sum_4_2v1h(i),575. ca1_h => car_4_2v1h(i));576.577. ppfa_5: ppfa port map( bth => bth_5_2v1h,578. x1_h => myx_2v1h(i+1),579. x2_h => myx_2v1h(i),580. s0_h => sum_4_2v1h(i+2),581. c0_h => car_4_2v1h(i+1),582. sum_h => sum_5_2v1h(i),583. ca1_h => car_5_2v1h(i));584.585. ppfa_6: ppfa port map( bth => bth_6_2v1h,586. x1_h => myx_2v1h(i+1),587. x2_h => myx_2v1h(i),588. s0_h => sum_5_2v1h(i+2),589. c0_h => car_5_2v1h(i+1),590. sum_h => sum_6_2v1h(i),591. ca1_h => car_6_2v1h(i));592.593. ppfa_7: ppfa port map( bth => bth_7_2v1h,594. x1_h => myx_2v1h(i+1),


7/30/2019 10.1.1.1.4951

69/112


595. x2_h => myx_2v1h(i),596. s0_h => sum_6_2v1h(i+2),597. c0_h => car_6_2v1h(i+1),598. sum_h => sum_7_2v1h(i),599. ca1_h => car_7_2v1h(i));600.601. p00_sum2 : gdffr_fall port map

602. ( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);603.604. p00_car2 : gdffr_fall port map605. ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);606.607. end generate G0;608.609. -- In column 15, the s0_h input is the "pp" output of the sign extender610. -- pp_h() is indexed by row number.611.612. G15: if( i = COL-1 ) generate613. ppfa15_0: ppfa port map( bth => bth_0_1v2h,614. x1_h => myx_1v2h(i+1),615. x2_h => myx_1v2h(i),

616. s0_h => GND,617. c0_h => GND,618. sum_h => sum_0_1v2h(i),619. ca1_h => car_0_1v2h(i));620.621. ppfa15_1: ppfa port map( bth => bth_1_1v2h,622. x1_h => myx_1v2h(i+1),623. x2_h => myx_1v2h(i),624. s0_h => pp_1v2h(1),625. c0_h => car_0_1v2h(i+1),626. sum_h => sum_1_1v2h(i),627. ca1_h => car_1_1v2h(i));628.629. ppfa15_2: ppfa port map( bth => bth_2_1v2h,630. x1_h => myx_1v2h(i+1),631. x2_h => myx_1v2h(i),632. s0_h => pp_1v2h(2),633. c0_h => car_1_1v2h(i+1),634. sum_h => sum_2_1v2h(i),635. ca1_h => car_2_1v2h(i));636.637. ppfa15_3: ppfa port map( bth => bth_3_1v2h,638. x1_h => myx_1v2h(i+1),639. x2_h => myx_1v2h(i),640. s0_h => pp_1v2h(3),


7/30/2019 10.1.1.1.4951

70/112


641. c0_h => car_2_1v2h(i+1),642. sum_h => sum_3_1v2h(i),643. ca1_h => car_3_1v2h(i));644.645. p15_sum1 : gdffr_fall port map646. ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);647.

648. p15_car1 : gdffr_fall port map649. ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);650.651. -- use value before the tri-state (don't use myx_1v2h)652. p15_x1 : gdffr_fall port map653. ( Q=> myx_2v1h(i), D=> tempx(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);654.655. ppfa15_4: ppfa port map( bth => bth_4_2v1h,656. x1_h => myx_2v1h(i+1),657. x2_h => myx_2v1h(i),658. s0_h => pp_2v1h(4),659. c0_h => car_3_2v1h(i+1),660. sum_h => sum_4_2v1h(i),661. ca1_h => car_4_2v1h(i));

662.663. ppfa15_5: ppfa port map( bth => bth_5_2v1h,664. x1_h => myx_2v1h(i+1),665. x2_h => myx_2v1h(i),666. s0_h => pp_2v1h(5),667. c0_h => car_4_2v1h(i+1),668. sum_h => sum_5_2v1h(i),669. ca1_h => car_5_2v1h(i));670.671. ppfa15_6: ppfa port map( bth => bth_6_2v1h,672. x1_h => myx_2v1h(i+1),673. x2_h => myx_2v1h(i),674. s0_h => pp_2v1h(6),675. c0_h => car_5_2v1h(i+1),676. sum_h => sum_6_2v1h(i),677. ca1_h => car_6_2v1h(i));678.679. ppfa15_7: ppfa port map( bth => bth_7_2v1h,680. x1_h => myx_2v1h(i+1),681. x2_h => myx_2v1h(i),682. s0_h => pp_2v1h(7),683. c0_h => car_6_2v1h(i+1),684. sum_h => sum_7_2v1h(i),685. ca1_h => car_7_2v1h(i));686. p15_sum2 : gdffr_fall port map


7/30/2019 10.1.1.1.4951

71/112


687. ( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);688.689. p15_car2 : gdffr_fall port map690. ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);691.692. end generate G15;693.

694. -- In column 16, the s0_h input is the "ff" output of the sign extender695. -- The c0_h input is 0.696.697. G16: if( i = COL ) generate698. ppfapp_0: ppfapp port map( bth => bth_0_1v2h,699. x1_h => myx_1v2h(i+1),700. x2_h => myx_1v2h(i),701. s0_h => GND,702. c0_h => GND,703. pp_h => pp15_1v2h(0),704. sum_h => sum_0_1v2h(i),705. ca1_h => car_0_1v2h(i));706.707. ppfapp_1: ppfapp port map( bth => bth_1_1v2h,

708. x1_h => myx_1v2h(i+1),709. x2_h => myx_1v2h(i),710. s0_h => ff_1v2h(1),711. c0_h => GND,712. pp_h => pp15_1v2h(1),713. sum_h => sum_1_1v2h(i),714. ca1_h => car_1_1v2h(i));715.716. ppfapp_2: ppfapp port map( bth => bth_2_1v2h,717. x1_h => myx_1v2h(i+1),718. x2_h => myx_1v2h(i),719. s0_h => ff_1v2h(2),720. c0_h => GND,721. pp_h => pp15_1v2h(2),722. sum_h => sum_2_1v2h(i),723. ca1_h => car_2_1v2h(i));724.725. ppfapp_3: ppfapp port map( bth => bth_3_1v2h,726. x1_h => myx_1v2h(i+1),727. x2_h => myx_1v2h(i),728. s0_h => ff_1v2h(3),729. c0_h => GND,730. pp_h => pp15_1v2h(3),731. sum_h => sum_3_1v2h(i),732. ca1_h => car_3_1v2h(i));


7/30/2019 10.1.1.1.4951

72/112


733.734. p16_sum1 : gdffr_fall port map735. ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);736.737. p16_car1 : gdffr_fall port map738. ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);739.

740. -- don't use myx_1v2h, use tempx from before the tristate741. p16_x1 : gdffr_fall port map742. ( Q=> myx_2v1h(i), D=> tempx(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);743.744. ppfapp_4: ppfapp port map( bth => bth_4_2v1h,745. x1_h => myx_2v1h(i+1),746. x2_h => myx_2v1h(i),747. s0_h => ff_2v1h(4),748. c0_h => GND,749. pp_h => pp15_2v1h(4),750. sum_h => sum_4_2v1h(i),751. ca1_h => car_4_2v1h(i));752.753. ppfapp_5: ppfapp port map( bth => bth_5_2v1h,

754. x1_h => myx_2v1h(i+1),755. x2_h => myx_2v1h(i),756. s0_h => ff_2v1h(5),757. c0_h => GND,758. pp_h => pp15_2v1h(5),759. sum_h => sum_5_2v1h(i),760. ca1_h => car_5_2v1h(i));761.762. ppfapp_6: ppfapp port map( bth => bth_6_2v1h,763. x1_h => myx_2v1h(i+1),764. x2_h => myx_2v1h(i),765. s0_h => ff_2v1h(6),766. c0_h => GND,767. pp_h => pp15_2v1h(6),768. sum_h => sum_6_2v1h(i),769. ca1_h => car_6_2v1h(i));770.771. ppfapp_7: ppfapp port map( bth => bth_7_2v1h,772. x1_h => myx_2v1h(i+1),773. x2_h => myx_2v1h(i),774. s0_h => ff_2v1h(7),775. c0_h => GND,776. pp_h => pp15_2v1h(7),777. sum_h => sum_7_2v1h(i),778. ca1_h => car_7_2v1h(i));


7/30/2019 10.1.1.1.4951

73/112


779. p16_sum2 : gdffr_fall port map780. ( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);781.782. p16_car2 : gdffr_fall port map783. ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);784.785. end generate G16;

786.787. end generate;788.789. -- need to latch bit 17 of "myx", since that is not in the generates above790. glatch_x17 : gdffr_fall port map ( Q=> myx_2v1h(17), D=> myx_1v2h(17),791. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);792.793. -- need bit 16 (=x_in(15)) in 3rd pipeline stage for overflow794. glatch_x15 : gdffr_fall port map ( Q=> myx_3v2h, D=> myx_2v1h(16),795. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);796.797. -- These are the tri-state latched outputs going to the adder798. gsum_0_2: gdffrN_fall generic map (2) port map ( Q=> sum_0_2v1h, D=> sum_0_1v2h(1 downto 0),799. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);

800. gsum_1_2: gdffrN_fall generic map (2) port map ( Q=> sum_1_2v1h, D=> sum_1_1v2h(1 downto 0),801. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);802. gsum_2_2: gdffrN_fall generic map (2) port map ( Q=> sum_2_2v1h, D=> sum_2_1v2h(1 downto 0),803. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);804.805. gca1_0_2: gdffr_fall port map ( Q=> car_0_2v1h, D=> car_0_1v2h(0),806. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);807. gca1_1_2: gdffr_fall port map ( Q=> car_1_2v1h, D=> car_1_1v2h(0),808. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);809. gca1_2_2: gdffr_fall port map ( Q=> car_2_2v1h, D=> car_2_1v2h(0),810. Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);811.812. gsum_4_3: gdffrN_fall generic map (2) port map ( Q=> sum_4_3v2h, D=> sum_4_2v1h(1 downto 0),813. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);814. gsum_5_3: gdffrN_fall generic map (2) port map ( Q=> sum_5_3v2h, D=> sum_5_2v1h(1 downto 0),815. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);816. gsum_6_3: gdffrN_fall generic map (2) port map ( Q=> sum_6_3v2h, D=> sum_6_2v1h(1 downto 0),817. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);818.819. gca1_4_3: gdffr_fall port map ( Q=> car_4_3v2h, D=> car_4_2v1h(0),820. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);821. gca1_5_3: gdffr_fall port map ( Q=> car_5_3v2h, D=> car_5_2v1h(0),822. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);823. gca1_6_3: gdffr_fall port map ( Q=> car_6_3v2h, D=> car_6_2v1h(0),824. Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);


7/30/2019 10.1.1.1.4951

74/112

December 1, 2

10.1.1.1.4951

Documents

Transcript of 10.1.1.1.4951