Download - EE5324 Adders

Spring 2006 EE 5324 - VLSI Design II - © Kia Bazargan 1

EE 5324 – VLSI Design IIEE 5324 – VLSI Design II

Kia Bazargan

University of Minnesota

Part II: AddersPart II: Adders


References and Copyright

• Textbooks referenced [WE92] N. H. E. Weste, K. Eshraghian

“Principles of CMOS VLSI Design: A System Perspective”Addison-Wesley, 2nd Ed., 1992.

[Rab96] J. M. Rabaey“Digital Integrated Circuits: A Design Perspective”Prentice Hall, 1996.

[Par00] B. Parhami“Computer Arithmetic: Algorithms and Hardware Designs”Oxford University Press, 2000.


References and Copyright (cont.)

• Slides used [©Hauck] © Scott A. Hauck, 1996-2000;

G. Borriello, C. Ebeling, S. Burns, 1995, University of Washington

[©Prentice Hall] © Prentice Hall 1995, © UCB 1996

Slides for [Rab96] http://bwrc.eecs.berkeley.edu/Classes/IcBook/instructors.html

[©Oxford U Press] © Oxford University Press, New York, 2000 Slides for [Par00] With permission from the authorhttp://www.ece.ucsb.edu/Faculty/Parhami/files_n_docs.htm


Outline

• One-bit adder, basic ripple-carry

adder

• Carry-Lookahead adders (CLA)

• Manchester carry chain

• Carry bypass

• Carry select adder

• Brent-Kung adder


Why Adders?

• Addition: a fundamental operation Basic block of most arithmetic operations Address calculation

• Faster, faster and faster• How?

Architectural level optimization Gate-level optimization Speed/area trade-off


• One-bit Half Adder:

• One-bit Full Adder:

Adding Two One-bit Operands

Sum = A B Cin

Cout = A.B + B.Cin + A.Cin

FA

A B

CinCout

Sum

Sum = A B

Cout = A.BHA

A B

Cout

Sum

A B Sum Cout0 0 0 00 1 1 01 0 1 01 1 0 1

Cin A B Sum Cout 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1


N-Bit Ripple-Carry Adder: Series of FA Cells

• To add two n-bit numbers

C0FA

A0

S0

B0

FA

A1

S1

B1

FA

A2

S2

B2

FA

An-1

Sn-1

Bn-1

Cn. . .

• Note: adder delay = Tc * n

• Tc = (Cin:Cout delay)FA

A B

CinCout

Sum


4-bit Ripple Carry Addition: Example

C0FA

A0

S0

B0

FA

A1

S1

B1

FA

A2

S2

B2

FA

A3

S3

B3

C4 C1C2C3

T=1 00 10 10 01

00 10 01 11

0

00 00 00 00T=0

B=0101

A=0011

S=0000

S=0110

00 10 01 01T=2 S=0100

00 01 01 01T=3 S=0000

10 01 01 01T=4 S=1000


One-bit Full Adder Implementation

• Direct gate implementation

Cout = A.B + B.Cin + A.Cin = A.B + Cin. (A+B)

Sum = A B Cin

AB

CinSum

AB

AB

Cin Cout

32 Transistors Used32 Transistors Used

[WE92] p516


includes 111

excludes 000

One-Bit Full Adder: Share Logic

• An observation Almost always,

sum = NOT carry

Cin A B Sum Cout 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1

Sum = A.B.Cin + (A+B+Cin).Cout


One-Bit Full Adder: Transistor Implementation

Sum = A.B.C + (A+B+C).CoutCout = A.B + C.(A+B)

A B B

AC

ABA B

C

Cout

C B AABC

CBACBA

Sum

– Use inverters to get Cout and Sum– C transistors close to output– Cout delay: 2 inverting stages (1-stage

possible?)– Sum delay: 3 inverting stages (not an issue,

though)

28 Transistors28 Transistors28 Transistors28 Transistors

[WE92] p517[Rab96] p390


• An observation Invert inputs =>

outputs invert

• Exploit this property: Get rid of the inverter

on the carry critical path

One-Bit Full Adder: Inverted Inputs

FA

Cin A B Sum Cout 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1

FA


Ripple Carry Adder: Inverting Property

FA’ is similar to FA, but with no inverters on the outputs

Much faster (1-stage) Disadvantage: not regular data path

A1

S1

B1

C2C0

A0 B0

S0

C1

A2 B2

S2

C3. . . FA’

A3

S3

B3

C4

FA’ FA’FA’


Summary: Ripple-Carry Adder

• Basic ripple carry: AND-OR gates Area: 32 transistors (per bit position) Delay: 2 stages of inverting logic (per bit

position)

• Direct CMOS logic, share Cout’ Area: 28 transistors Delay: 2 stages

• Use “inverting” property Area: 27 (odd bits:26, even bits:28) Delay: ~1 stage

• So far: transistor/logic manipulation• Is that all we can do?!!


Outline


adder



• Carry bypass




Carry-Lookahead Adder: Idea

• New look: carry propagation• Idea:

Try to “predict” Ck earlier than Tc*k Instead of passing through k stages, compute

Ck separately using 1-stage CMOS logic

• Carry propagation: an example

Bit position

Carry

A B

Sum

7 6 5 4 3 2 1 0

1 0 0 1 1 1 1

0 1 0 0 1 1 0 1 +0 1 0 0 0 1 1 1

1 0 0 1 0 1 0 0


0-propagate

1-propagate generate

kill

(kill) (propagate) (propagate) (generate)

Carry-Lookahead Adder (CLA): One Bit

• What happens to thepropagating carry inbit position k? 0 0 - 0

0 1 C C 1 0 C C 1 1 - 1

C

A

A

B

BA

A

B

BCout

[Rab96] p391

p = A+B (or A B)

g = A.B

A B Cin Cout


CLA: Propagation Equations

• If C4=1, then either: g3 generated at bit pos 3

g2.p3 generated at bit pos 2, propagated 3

g1.p2.p3 generated at bit pos 1, propagated 2,3

g0.p1.p2.p3 generated at bit pos 0, propagated 1,2,3

Cin.p0.p1.p2.p3 input carry, propagated 0,1,2,3

• C4 = g3+ g2.p3 + g1.p2.p3 + g0.p1.p2.p3 + Cin.p0.p1.p2.p3

Implement Implement CC44 as a one-stage CMOS logic as a one-stage CMOS logic

delay=1 (or is it?) delay=1 (or is it?)

Implement Implement CC44 as a one-stage CMOS logic as a one-stage CMOS logic

delay=1 (or is it?) delay=1 (or is it?)


p3.g2 C4

p1.g2.g3C4

CLA: Static Logic Implementation

p0

p1

p2

p3

Cin

g0

g1

g2

g3

C4

[©Hauck][Rab96] p405

d

e

f

h

j

k

l

m

n

s

r

q

o

t

u

v

w

x


6 transistors6 transistorsin seriesin series

CLA: Dynamic Logic Implementation

• Dynamic gate implementation: C4 = g3+ p3 . (g2 + p2 . (g1 + p1 . (g0 + P0.Cin)))

C4

Cin

p0

p1

p2

p3

g0

g1

g2

g3

[©Hauck][WE92] p529



• Can we reuse logic? Can we get C1, C2 and C3 from the same circuit?

C4

Cin

p0

p1

p2

p3

g0

g1

g2

g3

C1?

C2?

C3?

[©Hauck]

No!No! C1, C2 and C3 C1, C2 and C3 may be floating may be floating (not precharged) (not precharged)

Charge sharingCharge sharing problem problem

No!No! C1, C2 and C3 C1, C2 and C3 may be floating may be floating (not precharged) (not precharged)

Charge sharingCharge sharing problem problem



[WE92] p529

C1g0p0

Cin

p1 g1

C2

g0p0

Cin

p1

p2

g1

g2

C3

g0p0

Cin

p1

p2

p3

g1

g2

g3

C4

g0p0

Cin


CLA: Basic Block (4 Bits) Architecture

• Block of 4-bit p, g, Cout

C0

A0

S0

B0A1

S1

B1A2

S2

B2A3

S3

B3

p,g p,g p,g p,g

p0 g0p1 g1p2 g2p3 g3

C1C2

C3

C4


CLA: N-Bit Architecture

• Put it all together:

C0

B0A0

S0

A1

S1

B1A2

S2

B2A3

S3

B3

p,g p,g p,g p,g

C4

A4

S4

A5

S5

B5A6

S6

B6A7

S7

B7

p,g p,g p,g p,g

B4

C8

…

…

…

…

Carry Generator Carry Generator


CLA: 12-Bit Example

p,g p,g p,g p,g

S0S1S2S3S4S5S6S7

p,g p,g p,g p,g

B0B1B2B3B5B6B7 B4

C0

C4

Carry Generator Carry Generator

C8

S8S9S10S11

p,g p,g p,g p,g

B9B10

A0A1A2A3A4A5A6A7A8A9A10A11B11 B8

Carry Generator

C12

00000 00000 00000T=0

01111101

01101001

11011010

0

B=A=

01001 11110 01111T=201001 00001 01111T=301011 00001 01111T=4


Summary: Carry Lookahead Adder

• CLA compared to ripple-carry adder: Faster (“4 times”?),

but delay still linear (w.r.t. # of bits) Larger area

o P, G signal generationo Carry generation circuitso Carry generation ckt for each bit position (no re-use)

• Limitation: cannot go beyond 4 bits of look-ahead Large p,g fan-out slows down carry generation

• Next: Manchester carry chains Tries to reuse logic by pre-charging each carry

position


Outline


adder



• Carry bypass




Recap: Carry Look-Ahead

• Charge sharing problem

C4

Cin

p0

p1

p2

p3

g0

g1

g2

g3

C1?

C2?

C3?


C1 C2 C3

Manchester Carry Chain: First Shot

• Improvement over CLA: Precharge internal nodes to avoid charge-sharing

problem

Cin g0

p0

g1

p1

g2

p2

g3

p3

C4

[©Hauck]

• Fastest way to do small adders– 6 transistors on the critical path


Manchester Carry Chain: Sizing

R1

C1

R2

C2

R3

C3

R4

C4

R5

C5

R6

C6

Out

M0 M1 M2 M3 M4MC

Discharge Transistor

1 2 3 4 5 6

tp 0.69 Ci Rjj 1=

i

i 1=

N=

1 1.5 2.0 2.5 3.0k

5

10

15

20

25

Spe

ed

1 1.5 2.0 2.5 3.0k

0

100

200

300

400

Are

a

Speed (normalized by 0.69RC) Area (in minimum size devices)

[© Prentice Hall] (“k” is t

he s

izin

g f

acto

r)

dela

y


Manchester Carry Chain: An Improvement• Problem: Cin arrives late move it closer to output

Use bypass logic:

Cin g0

p0

g1

p1

g2

p2

g3

p3

C4

p0 p1 p2 p3

Cin

[©Hauck]


Manchester Carry Chain: the Improvement

• Direct implementation

Cin

p0 g0 p1 g1 p2 g2 p3 g3

C4

C1 C2 C3

[©Hauck]

p0 p1 p2 p3

Cin

Cin

C4

C4

• Carry bypass circuitry

• Advantages of the carry bypass circuitry– Only 5 series transistors– Less capacitance in internal nodes

– Cin close to the output


Manchester Carry Chain: Summary• Compared to CLA:

Smaller areao Pre-charge internal nodeso Reuse logic for intermediate carry signals

Cin close to the output

• Carry chain can be any length Series propagate is slow (O(n2) delay)

buffer every 4 bits

• Compact adder: good for up to 16 bits• Using carries to compute sum slows down

MCC– Use two carry chains: one for sum, one for carry propagation

[©Hauck]


Outline


adder



• Carry bypass




Carry Bypass Adder: Idea

• The “bypass” idea is general Not just for Manchester carry chain The local carry chain could be “ripple carry adder”

Ci

Bit i to i+k

Setup

LocalCarryChain

Sum

Ci+k+1

BypasBypass?s?

• Structure– Could be static,

dynamic, pass transistor

– Carry and sum paths shown in different colors

– Bypass logic determines: “pass” or “kill/generate”?


Local Carry Chain

• Static implementation, using ripple carry adder

• Dynamic, Manchester (mux=wire!)

Carry Bypass Adder: Cell Examples

FA FA FA FA

p0.p1.p2.p3

g0 g1

p1

g2

p2

g3

p3

C4

p0 p1 p2 p3

Cin

[Rab96] p398

p0


Carry Bypass Adder: Cell Examples (cont.)

• Static (pass transistor logic), Manchester

T1=(p0.p1.p2).p3 T2=p3 T3=p0.p1.p2.p3

p0

p0

p0

g0

p1

p1

p1

g1

p2

p2

p2

g2

T2

T1

T1

g3

T2

T3

T3

C4C0

[WE92] p531


Carry Bypass Adder: the Structure and Timing

Bit 0-3

C0

[Rab96] p.399

Setup

LocalCarryChain

Sum

Bit 4-7

Setup

LocalCarryChain

Sum

Bit 8-11

Setup

LocalCarryChain

Sum

Bit 12-15

Setup

LocalCarryChain

Sum

• Timing (Critical path shown in different color):1-Setup2-Local carry generate/kill, MUX select line ready3-C0-C16 carry propagate (if applicable)


LocalCarryChain

Sum

Bit 8-11Setup

LocalCarryChain

Sum

Bit 8-11Setup

• For an intermediate stage, after setup: If in pass mode

o Local carry vector computes intermediatecarries (possibly incorrectly)

o At the same time, mux selection set to passo When input carry arrives, intermediate carries

might be recomputedo Meanwhile, input carry is sent to Cout

Carry Bypass Adder: Timing of a Sub-block

Sum

Setup

Setup– If not pass mode (assume bit 10

generates)• Local carry vector computes intermediate

carries (bits 10, 11 correc)• At the same time, mux selection set to

local• Meanwhile, output carry is sent to Cout

correctly• When input carry arrives, intermediate

carriesC8and C9 (S8,S9,S10) will be recomputed correctly

LocalCarryChain

LocalCarryChain

Sum

LocalCarryChain

SumSum

LocalCarryChain


3 x tFA+ tsum3 xtmux_pass +

max { tselect , 4 x tFA} +tsetup+

Carry Bypass Adder: Timing

Bit 0-3

C0

Setup

LocalCarryChain

Sum

Bit 4-7

Setup

LocalCarryChain

Sum

Bit 8-11

Setup

LocalCarryChain

Sum

Bit 12-15

Setup

LocalCarryChain

Sum

Delay =


Carry Bypass Adder: Pros and Cons

• Speed: Faster than

ripple adder Still linear!

• Area overhead: Mux (setup?) Not worth for

small adders (N<8) 10-20% for

large adders

[Rab96] p.399

Pro

pag

ati

on

Dela

yNumber of

bits

4..8

Ripple Adder

Bypass Adder


Outline


adder



• Carry bypass




Carry Select Adder: the Idea

• Similar to bypass Instead of “waiting” for

the input carry, ”precompute” the carry output

Compute Ci+k for both cases Ci=0 and Ci=1

When Ci arrives, select the appropriate result

Sum computed in one step after the intermediate carry signals are ready

[Rab96] p.400

p,g p,g

MultiplexersCi Ci+k

Sum GenerationCarry Vector

Setup (p,g)

k bits

0-Carry propagation

1-Carry propagation1

0


Linear Carry Select Adder: Structure

C0

Sum

Setup

Bits 0-3

0-Carry

1-Carry1

0

C4

Sum

Setup

Bits 4-7

0-Carry

1-Carry1

0

C8

Sum

Setup

Bits 8-11

0-Carry

1-Carry1

0

C12

Sum

Setup

Bits 12-15

0-Carry

1-Carry1

0

C16

[Rab96] p.401


Linear Carry Select Adder: Timing

Setup

Bits 0-3

Setup

Bits 4-7

Setup

Bits 8-11

Setup

Bits 12-15

C0 C4

Sum

C8

Sum

C12

Sum

0-Carry

1-Carry1

0 0-Carry

1-Carry1

0 0-Carry

1-Carry1

0 0-Carry

1-Carry1

0

Sum

C16

Delay = 3 + 1 + 1 + 1 + 1 = 7 (16 bits)

[Rab96] p.401


Square Root Carry Select Adder: the Idea

• Later stages have to wait for the multiplexers in the earlier stages

• Why not give them bigger chunks of data to compute? Balances the delay paths Sub-linear delay (we will see why)


3

Square Root Carry Select Adder: the Structure

• Assuming the following delays: Setup=1, carry propagate=1/bit, mux=1

C0Sum

Bits 0-1

C2

Bits 2-4

C5

4

Bits 5-8

C9

5

Bits 9-13

C14

6

Bits 14-19

C19

7

Delay from all paths = 8 (20 bits)

[Rab96] p.402


Square Root Carry Select Adder: Delay• Assume

N-bit adder P stages (delay directly depends on P) First stage computes M bits

• For M<<N (e.g. N=64, M=2) The first term dominates N P2/2

)2

1(

2

2

)1(

)1()2()1(

2

MPP

PPMP

PMMMMN

)2

1(

2

2

)1(

)1()2()1(

2

MPP

PPMP

PMMMMN

NP 2 NP 2


Carry Select Adder: Trade-offs• Area overhead:

An additional carry path and a multiplexer (not the whole adder)

About 30% more than a ripple-carry

• Delay Sub-linear (we can beat that too!)

0 20 40 60Number of bits

0.0

10.0

20.0

30.0

40.0ripple adder

linear select

square root select

[© Prentice Hall]


Outline


adder



• Carry bypass




Binary Carry-Lookahead or Brent-Kung Adder

• Idea: use binary tree for carry propagation logarithmic delay

A7

F

A6A5A4A3A2A1

A0

A0A1A2A3A4A5A6A7

F

tp log2(N)

tp N

[© Prentice Hall]


Brent-Kung Adder

• Basic component

Concatenation

MSB LSB

gleft pleft gright pright

g p

(g, p)

g = gleft + pleft • gright

p = pleft • pright

(gleft, pleft) (gright pright)

[©Hauck]


No! Doesn’t know aboutC0-3 yet!

C5?

Brent-Kung Adder: Structure• Define (Gi, Pi)

generate and propagate for least significant i bits(G0,P0) = (g0,p0) gi = Ai.Bi pi = AiBi

for i>0: (Gi, Pi) = (gi, pi) • (Gi-1, Pi-1)

= (gi, pi) • (gi-1, pi-1) • . . . . • (g1, p1)

• Key to Brent-Kung adder – use tree structure to perform concatenations

7 6 5 4 3 2 1 0

7-6 5-4 3-2 1-0

7-4 3-0

7-0 [©Hauck]


Brent-Kung: the Complete Tree

tadd log2 (N) [© Prentice Hall]

(g0 ,p0)(g1 ,p1)

(g2 ,p2)

(g3 ,p3)

(g4 ,p4)

(g5 ,p5)

(g6 ,p6)

(g7 ,p7)

C0C1

C3

C7

C2

C6

C5

C4


Brent-Kung: Timing

[©Oxford U Press][Par00] p.102

x0x1x2x3x4x5x6x7x8x9x10x11

x12x13x14x15

s0s1s2s3s4s5s6s7s8s9s10s11s12s13s14s15

1

2

3

4

5

6

Level


Brent-Kung Adder: Summary

• Area On average, twice as large as ripple adder Layout of the cells is very compact

• Delay Logarithmic time Once carry signals are ready,

sum bits derived in const time Good for wide adders


Comparing Adder Designs

0 10 20

Number of bits

0

20

40

60

80

0 10 20

Number of bits

0

0.2

0.4

Brent-Kung

select

bypassmanchestermirrorstatic

manchester

Brent-Kung

select

static

mirrorbypass

[© Prentice Hall]

t p(s

ec)

Are

a (

mm

2)


Combining Different Adders




• Two-level carry skip adder Delay = 8 cycles Number of bits: 30

Blk E Block D Block C Block B Block AF

7 6 5 4 3 3

2 Cint=0

Coutt=8


c c

80

7 6 5 34 3

b b b b b b{8, 1} {7, 2} {6, 3} {5, 4} {4, 5} {3, 8}

inoutABCDEF

S2 S2 S2 S2 S2

Tproduce Tassimilate



40 BitCarry Select Adder

24 BitDifferential CarryLookahead Adder

MSB LSBRA(23:0) RB(23:0)RA(63:24) RB(63:24)

cout2364 Bit Adder

EA(63:24)

EA(23:0)

real_add(40:0)hit/miss/data

TLB

Compare

DataCache

Compare

© Dan Stasiak, IBM Rochester, 2001



• Ripple+skip adder: delay=8. Max adder width? Assume: p,g, ripple, skip signal, skipping: 1 unit delay Carry signals

o Pass mode: ready at time x through skip logic limit # blockso Local gen mode: blocks can process y bits and still have time to

deliver locally generated carry by time x for the next block.

Sum signalso If in local generation mode, y is OK

o If in pass mode, y not OK for left bits (e.g., bE receives cin at x=5, can process at most z=3 bits to meet the delay bound of 8 on the sum bits)


Cout Cin

7

0

6 5 4 232

b b b b b bABCDEF

S S S S S

bG

7 6 5 4 3 11 2 3 4

Should appear before

slide 126