Digital Integrated Circuitshtang/ece4311_doc_F11/LectureSlide/week… · © Digital Integrated...

EE141© Digital Integrated Circuits2ndArithmetic Circuits

1

Digital Integrated

CircuitsA Design Perspective

Arithmetic Circuits

Jan M. Rabaey

Anantha Chandrakasan

Borivoje Nikolic


2

A Generic Digital Processor

MEMORY

DATAPATH

CONTROL

INP

UT

-OU

TP

UT


3

Building Blocks for Digital Architectures

Arithmetic unit

- Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.)

Memory

- RAM, ROM, Buffers, Shift registers

Control

- Finite state machine (PLA, random logic.)

- Counters

Interconnect

- Switches

- Arbiters

- Bus


4

Arithmetic building blocks Speed and power of arithmetic components often

dominates the overall system performance

For each module, multiple topologies and ways of

design exists, with each of them has its own advantages

A global picture is of crucial importance. A designer

focus their attention on gates or transistors that have the

largest impact on their goal function. Non-critical

components can be developed routinely.

Typically two optimization process: logic optimization

(re-arrange Boolean equations so that a faster or small

circuit could be obtained) and circuit optimization

(manipulate circuit topology and transistor sizes to

optimize speed)


5

Bit-Sliced Design

Bit 3

Bit 2

Bit 1

Bit 0

Regis

ter

Ad

der

Sh

ifte

r

Mu

ltip

lexer

Control

Data

-In

Data

-Ou

t

Tile identical processing elements

Since the same operation has to be performed on each

bit of a data word, the data path can consist of the

number of bit slices (equal to the word length), each

operating on a single bit – hence the term bit-sliced


6

Adders


7

Full-AdderA B

Cout

Sum

Cin Fulladder

Generate (G) = AB

Propagate (P) = A B

Delete (D) = A B

G,D, ensures a carry bit will be generated or deleted at Co independent of Ci,

While P guarantees that Ci will propagate to Co.


8

The Binary Adder

S A B Ci

=

A= BCi ABCi ABCi

ABCi

+ + +

Co

AB BCi

ACi

+ +=

A B

Cout

Sum

Cin Fulladder


9

The Ripple-Carry Adder

Worst case delay linear with the number of bits

Goal: Make the fastest possible carry path circuit

FA FA FA FA

A0 B0

S0

A1 B1

S1

A2 B2

S2

A3 B3

S3

Ci,0 Co,0

(= Ci,1)

Co,1 Co,2 Co,3

td = O(N)

tadder = (N-1)tcarry + tsum


10

Complimentary Static CMOS Full Adder

28 Transistors

A B

B

A

Ci

Ci A

X

VDD

VDD

A B

Ci BA

B VDD

A

B

Ci

Ci

A

B

A CiB

Co

VDD

S


11

Complimentary Static CMOS Full Adder

Large PMOS stacks are present in both carry and sum

generation circuits

Intrinsic load capacitance of Co signal is large and

consists of eight capacitance components

There is one more inverter delay for carry and sum

(worse when the load capacitance is large)

Note that critical signal Ci closer to the output node


12

Express Sum and Carry as a function of

P, G, D

Define 3 new variable which ONLY depend on A, B

Generate (G) = AB

Propagate (P) = A B

Delete (D) = A B

Can also derive expressions for S and Co based on D and P

Propagate (P) = A + B

Note that we will be sometimes using an alternate definition for


13

Transmission Gate XOR

A

B

F

B

A

B

B

M1

M2

M3/M4

tionimplementaary complementfor rs transisto12 ),( BABAF +

When B=1, M1/M2 inverter, M3/M4 off, so F=AB

When B=0, M1/M2 off, M3/M4 transmission gate, so F=AB


14

Transmission Gate Full Adder

A

B

P

Ci

VDDA

A A

VDD

Ci

A

P

AB

VDD

VDD

Ci

Ci

Co

S

Ci

P

P

P

P

P

Sum Generation

Carry Generation

SetupPropagate (P) = A B


15

Manchester Carry Chain

CoC

i

Gi

Di

Pi

Pi

VDD

CoC

i

Gi

Pi

VDD

Generate (G) = AB

Propagate (P) = A B

Delete = A B

Prevent floating Co


16

Full-AdderA B

Cout

Sum

Cin Fulladder


17


G2

C3

G3

Ci,0

P0

G1

VDD

G0

P1

P2

P3

C3

C2

C1

C0


18


Pi + 1

Gi + 1

Ci

Inverter/Sum Row

Propagate/Generate Row

Pi

Gi

Ci - 1

Ci + 1

VDD

GND

Stick Diagram


19

Manchester Carry Chain Delay for the Manchester Carry Chain can be

modeled similar to a linearized RC network as in

transmission-gates

This means the propagation delay is quadratic in the

number of bits N (but does not imply the delay will be

larger than the ripple carry adder)

It might be necessary to insert signal buffering

inverters.

Still a ripple carry adder, typically only good for small

word length (<8/16 bits)

We need faster adders for computer and multimedia

applications with word length 32-128 bits


20

Carry-Bypass Adder

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci ,0

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,2Co,1Co,0Ci,0

Co,3

Mu

ltip

lexer

BP=PoP1P2P3

Idea: If (P0 and P1 and P2 and P3 = 1)

then Co3 = C0, else “kill” or “generate”.

Also called

Carry-Skip

Break the bit-slice organization

G0

G0

P1

P1

“delete” or “generate”


21

Carry-Bypass Adder (cont.)

Carrypropagation

Setup

Bit 0–3

Sum

M bits

tsetup

tsum

Carrypropagation

Setup

Bit 4–7

Sum

tbypass

Carrypropagation

Setup

Bit 8–11

Sum

Carrypropagation

Setup

Bit 12–15

Sum

tadder = tsetup + Mtcarry + (N/M-1)tbypass + (M-1)tcarry + tsum

Tsetup: overhead time to create G, P, D signals

(worst case)


22

Carry Ripple versus Carry Bypass

(both still linear)

N

tp

ripple adder

bypass adder

4..8


23

Carry-Select Adder

Setup

"0" Carry Propagation

"1" Carry Propagation

Multiplexer

Sum Generation

Co,k-1 Co,k+3

"0"

"1"

P,G

Carry Vector


24

Carry Select Adder: Critical Path

0

1

Sum Generation

Multiplexer

1-Carry

0-Carry

Setup

Ci,0 Co,3 Co,7 Co,11 Co,15

S0–3

Bit 0–3 Bit 4–7 Bit 8–11 Bit 12–15

0

1

Sum Generation

Multiplexer

1-Carry

0-Carry

Setup

S4–7

0

1

Sum Generation

Multiplexer

1-Carry

0-Carry 0-Carry

Setup

S8–11

0

1

Sum Generation

Multiplexer

1-Carry

Setup

S12–15


25

Linear Carry Select

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15

S0-3 S4-7 S8-11 S12-15

Ci,0

(1)

(1)

(5)(6) (7) (8)

(9)

(10)

(5) (5) (5)(5)

tadder = tsetup + Mtcarry + (N/M)tmux + tsum


26

Square Root Carry Select

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13

S0-1 S2-4 S5-8 S9-13

Ci,0

(4) (5) (6) (7)

(1)

(1)

(3) (4) (5) (6)

Mux

Sum

S14-19

(7)

(8)

Bit 14-19

(9)

(3)

M


27

Adder Delays - Comparison

Square root select

Linear select

Ripple adder

20 40

N

t p(in

un

it d

ela

ys)

600

10

0

20

30

40

50

Bypass


28

LookAhead - Basic Idea

Co k f Ak Bk Co k 1– Gk PkCo k 1–+= =

AN-1, BN-1A1, B1

P1

S1

• • •

• • • SN-1

PN-1Ci, N-1

S0

P0Ci,0 Ci,1

A0, B0


29

Look-Ahead: Topology

Co k Gk Pk Gk 1– Pk 1– Co k 2–+ +=

Co k Gk Pk Gk 1– Pk 1– P1 G0 P0Ci 0+ + + +=

Expanding Lookahead equations:

All the way:

Co,3

Ci,0

VDD

P0

P1

P2

P3

G0

G1

G2

G3


30

Look-Ahead Adder: Logarithmic adder

A7

F

A6A5A4A3A2A1

A0

A0

A1

A2

A3

A4

A5

A6

A7

F

tp log2(N)

tp N


31

Carry Look-Ahead Trees

Can continue building the tree hierarchically.

=G1:0+P1:0C0

(G1:0=G1+P1G0 P1:0=P1P0)

C0=G0+P0Cin

C1=G1+P1C0

C2=G2+P2C1

C3=G3+P3C2

C0=G0+P0Cin

C1=G1+P1C0 =G1+G0P1+P1P0Cin

C2=G2+P2C1 =G2+G1P2+G0P2P1+P2P1P0Cin =G2:1+P2:1C0

(G2:1=G2+P2G1 P2:1=P2P1)

C3=G3+P3C2 =G3+G2P3+G1P3P2+G0P3P2P1+P3P2P1P0Cin

=G3:2+P3:2C1=G3:2+P3:2(G1:0+P1:0C0)=(G3:2+P3:2G1:0)+P3:2P1:0C0

G3:2=(G3+P3G2) and P3:2=P3P2 are called dot products.


32

Tree Adders

16-bit radix-2 Kogge-Stone tree (radix 2 means that the tree is

Binary: it combines two dot product or carry words at a time at

Each level of hierarchy)

(A0,

B0)

(A1,

B1)

(A2,

B2)

(A3,

B3)

(A4,

B4)

(A5,

B5)

(A6,

B6)

(A7,

B7)

(A8,

B8)

(A9,

B9)

(A10,

B10)

(A11,

B11)

(A12,

B12)

(A13,

B13)

(A14,

B14)

(A15,

B15)

S0

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15


33

Tree Adders(a

0,

b0)

(a1,

b1)

(a2,

b2)

(a3,

b3)

(a4,

b4)

(a5,

b5)

(a6,

b6)

(a7,

b7)

(a8,

b8)

(a9,

b9)

(a1

0,

b1

0)

(a1

1,

b1

1)

(a1

2,

b1

2)

(a1

3,

b1

3)

(a1

4,

b1

4)

(a1

5,

b1

5)

S0

S1

S2

S3

S4

S5

S6

S7

S8

S9

S1

0

S1

1

S1

2

S1

3

S1

4

S1

5

16-bit radix-4 Kogge-Stone Tree


34

Sparse Trees(a

0,

b0)

(a1,

b1)

(a2,

b2)

(a3,

b3)

(a4,

b4)

(a5,

b5)

(a6,

b6)

(a7,

b7)

(a8,

b8)

(a9,

b9)

(a1

0,

b1

0)

(a1

1,

b1

1)

(a1

2,

b1

2)

(a1

3,

b1

3)

(a1

4,

b1

4)

(a1

5,

b1

5)

S1

S3

S5

S7

S9

S1

1

S1

3

S1

5

S0

S2

S4

S6

S8

S1

0

S1

2

S1

4

16-bit radix-2 sparse tree with sparseness of 2


35

Tree Adders(A

0,

B0)

(A1,

B1)

(A2,

B2)

(A3,

B3)

(A4,

B4)

(A5,

B5)

(A6,

B6)

(A7,

B7)

(A8,

B8)

(A9,

B9)

(A1

0,

B1

0)

(A1

1,

B1

1)

(A1

2,

B1

2)

(A1

3,

B1

3)

(A1

4,

B1

4)

(A1

5,

B1

5)

S0

S1

S2

S3

S4

S5

S6

S7

S8

S9

S1

0

S1

1

S1

2

S1

3

S1

4

S1

5

Brent-Kung Tree


36

Intel Itanium Microprocessor

9-1

Mu

x9

-1 M

ux

5-1

Mu

x2

-1 M

ux

ck1

CARRYGEN

SUMGEN

+ LU

1000um

b

s0

s1

g64

sum sumb

LU : Logical

Unit

SU

MS

EL

a

to Cache

node1

RE

GItanium has 6 integer execution units like this


37

Bit-Sliced Design

Bit 3

Bit 2

Bit 1

Bit 0

Regis

ter

Ad

der

Sh

ifte

r

Mu

ltip

lexer

Control

Data

-In

Data

-Ou

t

Tile identical processing elements


38

Bit-Sliced Datapath

Adder stage 1

Wiring

Adder stage 2

Wiring

Adder stage 3

Bit s

lice

0

Bit s

lice

2

Bit s

lice

1

Bit s

lice

63

Sum Select

Shifter

Multiplexers

Loopback B

us

From register files / Cache / Bypass

To register files / Cache

Loopback B

us

Loopback B

us

The adder is

implemented

as a radix-4

Carry Look-

Ahead adder,

the red lines

are forwarding

the results of

different stages


39

Itanium Integer Datapath

Courtesy of Intel


40

Multipliers


41

The Binary Multiplication

Z X··

Y Zk2k

k 0=

M N 1–+

= =

Xi2i

i 0=

M 1–

Yj2j

j 0=

N 1–

=

XiYj2i j+

j 0=

N 1–

i 0=

M 1–

=

X Xi2i

i 0=

M 1–

=

Y Yj2j

j 0=

N 1–

=

with


42

The Binary Multiplication

x

+

Partial products

Multiplicand

Multiplier

Result

1 0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 1 0

1 1 1 0 0 1 1 1 0

0 0 0 0 0 0

1 0 1 0 1 0

1 0 1 1


43

The Array Multiplier (4 by 4)

Y0

Y1

X3 X2 X1 X0

X3

HA

X2

FA

X1

FA

X0

HA

Y2X3

FA

X2

FA

X1

FA

X0

HA

Z1

Z3Z6Z7 Z5 Z4

Y3X3

FA

X2

FA

X1

FA

X0

HA

Z2

Z0

The carryout of the last adder for Yi is forwarded to Yi+1

carry

sum

Half

adder


44

The MxN Array Multiplier

— Critical Path

HA FA FA HA

HAFAFAFA

FAFA FA HA

Critical Path 1

Critical Path 2

Critical Path 1 & 2


45

Carry-Save Multiplier

HA HA HA HA

FAFAFAHA

FAHA FA FA

FAHA FA HA

Vector Merging Adder

A more efficient

realization can be obtained

by noticing that the

multiplication results does

not change when the output

carry bits are passed

diagonally downwards

instead of to the right.

But need extra adders

(vector merging adders) that

can use fast carry look

ahead adders (since results

come at the same time)

Critical path is uniquely

defined


46

Multiplier Floorplan

SCSCSCSC

SCSCSCSC

SCSCSCSC

SC

SC

SC

SC

Z0

Z1

Z2

Z3Z4Z5Z6Z7

X0X1X2X3

Y1

Y2

Y3

Y0

Vector Merging Cell

HA Multiplier Cell

FA Multiplier Cell

X and Y signals are broadcasted

through the complete array.

( )


47

Wallace-Tree Multiplier

6 5 4 3 2 1 0 6 5 4 3 2 1 0

Partial products First stage

Bit position

6 5 4 3 2 1 0 6 5 4 3 2 1 0

Second stage Final adder

FA HA

(a) (b)

(c) (d)

Save the number of full adders

Increase the complexity of routing


48


HA

Can use carry Look-Ahead adder for the last stage


49


FA

FA

FA

FA

y0 y1 y2

y3

y4

y5

S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci

FA

y0 y1 y2

FA

y3 y4 y5

FA

FA

CC S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci


50

Booth encoding Multiply by 01111110 gives 8 partial products, but two

are all zero. Add these zero is waste of time.

Instead, multiply by 100000010, where 1 stands for -1.

Then you need to only add (actually subtract) partial

products, which improves speed

This kind of transformation is called booth encoding. It

reduces the number of partial product to at most half of

the original multiplier width.

The encoding logic is easily incorporated in the overall

multiplier design.


51

Multipliers —Summary

• Optimization Goals Different Vs Binary Adder

• Once Again: Identify Critical Path

• Other possible techniques

- Data encoding (Booth)

- Pipelining

FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION

- Logarithmic versus Linear (Wallace Tree Mult)

This is also why algorithmic invention has significant

meaning to VLSI design.


52

Shifters


53

The Binary Shifter

Ai

Ai-1

Bi

Bi-1

Right Leftnop

Bit-Slice i

...


54

The Barrel Shifter

Sh3Sh2Sh1Sh0

Sh3

Sh2

Sh1

A3

A2

A1

A0

B3

B2

B1

B0

: Control Wire

: Data Wire

Area Dominated by Wiring

Column: maximum shift

Word length


55

4x4 barrel shifter

BufferSh3Sh2Sh1Sh0

A3

A2

A1

A0

Coder/decoder required to set shift bits

Signal pass through one gate independent of shift

amount (parasitic capacitance may change the picture)


56

Logarithmic ShifterSh1 Sh1 Sh2 Sh2 Sh4 Sh4

A3

A2

A1

A0

B1

B0

B2

B3

No separate coder/decoder is required


57

A3

A2

A1

A0

Out3

Out2

Out1

Out0

0-7 bit Logarithmic Shifter

Good for large shift amount (note that cascade pass

transistor slow down the gate and generate weak signals,

buffers may be needed)


58

Building Blocks for Digital Architectures

Arithmetic unit

- Bit-sliced datapath (adder, multiplier, shifter, comparator)

(comparator, divider, sin, cos etc)

Digital Integrated Circuitshtang/ece4311_doc_F11/LectureSlide/week… · © Digital Integrated...

Documents

Transcript of Digital Integrated Circuitshtang/ece4311_doc_F11/LectureSlide/week… · © Digital Integrated...