Speed and Power Trade-offs : Applied to Adder Design:
-
Upload
luke-jensen -
Category
Documents
-
view
24 -
download
1
description
Transcript of Speed and Power Trade-offs : Applied to Adder Design:
Speed and Power Trade-Speed and Power Trade-offsoffs: : Applied to Adder Applied to Adder
Design: Design: Vojin G. Oklobdzija, Ram KrishnamurthyVojin G. Oklobdzija, Ram Krishnamurthy
Intel AMR / ACSEL LaboratoryIntel AMR / ACSEL LaboratoryIntel Corp/ University of California DavisIntel Corp/ University of California Davis
www.ece.ucdavis.edu/acselwww.ece.ucdavis.edu/acsel
From: Tutorial PresentationFrom: Tutorial Presentation1616thth International Symposium on Computer International Symposium on Computer
Arithmetic Arithmetic Santiago de Compostela, SPAINSantiago de Compostela, SPAIN
June 18, 2003June 18, 2003
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
2
Issues to be addressed• How do we compare different
topologies for their efficiency ?• How do we estimate speed and
efficiency of our algorithm ?• What criteria's should we use when
developing a new algorithm ?• How does power enter into this
equation ?
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
3
Additional Issues • Determine which topology is the
best for given Power or Delay budget
• Determine which topology can stretch the furthest in terms of speed or power
Metric Metric
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
5
Previously used estimates Counting the number of gates (logic levels): not accurate
C in
C out C in
C 4C 8C 12
C out
C 20C 24C 28
C in
C 16
a ib i
ind ividua l addersgenera ting: g i, p i,
and sum S i
Carry-lookahead b locks o f4-b its genera ting:
G i, P i, and C in fo r theadders
C arry-lookahead super- b locks of4-b its b locks genera ting:
G * i, P * i, and C in fo r the 4-b itb locks
G roup p roducing fina lcarry C out and C 16
C ritica l pa th de lay = (for g i,p i)+2x2 (for G ,P )+3x2 (for C in)+1XO R - (for Sum ) = appx. 12of de lay
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
6
Critical path in Motorola's 64-bit CLACritical path in Motorola's 64-bit CLA
Critica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63
G4
P7
G0
P0
G1
P1
G2
P2
G3
P3
...
CARRYBLOCK
G8
P11
... G12
P15
... G16
P31
... G32
P47
... G48
P51
G60
P60
G61
P61
G62
P62
G63
P63
... G52
P55
... G56
P59
...
PG BLO CK
PG BLOCK
PG BLO CK
PG BLO CK
P,G
0
P,G
1:0
P,G
2:0
G3:0
P3:0
G7:4
P7:4
G11:8
P11:8
G15:12
P15:12
G3:0
P3:0
G7:0
P7:0
G11:0
P11:0
G15:0
P15:0
G15:0
P15:0
G31:16
P31:16
G31:0
P31:0
G47:32
P47:32
G47:0
P47:0
G51:48
P51:48
G55:52
P55:52
G59:56
P59:56
C64
G51:48
P51:48
G55:48
P55:48
G59:48
P59:48
P,G
60
P,G
61:60
P,G
62:60
G63:60
P63:60
G63:48
P63:48
G63:0
P63:0
C0
C4
C8
C12
C16
C32
C48
C16
C32
C48
C52
C56
C60
C63
PG BLO CK
C62
C61
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
7
Motorola's 64-bit CLAModified PG
Block
Intermediate propagate signals Pi:0 are generated to speed-up C3
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
8
Fan-In and Fan-Out DependencyFan-In and Fan-Out Dependency (Oklobdzija, Barnes: IBM 1985)
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
9
Delay Comparison: Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
Delay Complexity
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
10
Design Objective• Design takes time:
– finding results afterward is not of much value• There is a disconnect between measures
used by computer arithmetic when developing an algorithm and what is obtained after implementation– we want to estimate as close to the measured
results• A simple tool that can evaluate different
design trade-off for a given technology is needed
• Power trade-off is the most important– speed and power are tradable
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
11
Logical Effort Theory
•“Back of the Envelope” complexity: good for estimating speed
•Gate delay = linear function of load– Slope: logical effort gate driving
characteristics– Intersect: parasitic gate internal load
•“Logical Effort” accuracy is not sufficient– We needed to extend and refine the method– However, that becomes more than “Back of the
Envelope”•Logical Effort does not account for possible
power-delay trade-offs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
12
Logical Effort Theory• Excel –a platform of choice (ARITH-16)
– Simple enough– Can provide computation quickly– Easy to enter a given design
• Technology characterization is needed:– This needs to be done only once: available for
every design afterwards– Domino gate = 2 stages of dynamic and static
• Different driving characteristics of these stages• Multi-output gate (carry-look-ahead, Ling/conditional
sum)• Energy model needs to be included
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
13
AGUs: performance and peak-current limiters
High activity thermal hotspotGoal: high-performance energy-efficient
design
Energy Energy MotivationMotivation
Execution core
120oC
Cache
Processor thermal
map
AGU
Temp(oC)
*courtesy of Intel Corp.
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
14
Critical Paths of Critical Paths of Representative 64-bit Representative 64-bit
AddersAdders
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
15
Kogge-Stone Adder
Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b
Energy Energy inefficientinefficient
1235 4679 8101113 12141517 16181921 20222325 24262729 283031PGC
arry
-mer
ge g
ates
XOR
0
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
16
Sparse-tree Adder ArchitectureSparse-tree Adder Architecture
Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-
efficient
C27 C23 C19 C15 C11 C7 C3
293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
17
StageLogical Effort
(G)Branch
Effort (B)Int. Pitch
(C)Effective Brnch Effort (B+I.C)
Paras tic Com p.
Path Branch Effort = Bi
Path Logical Effort=Gi
Path EffortPath Delay
(ps)
PG 0.6 2 1 2.1 1.3CM0 1.48 2 2 2.2 2.5CM1 0.59 2 4 2.4 1.6CM2 1.48 2 8 2.8 2.5CM3 0.59 2 16 3.6 1.6CM4 1.48 1 0 1.0 2.5XOR 1.69 1 0 1.0 3.0Inv 1 1 0 1.0 1.0
124.63 93.97
Kogge Stone Adder
108.92 1.14
Kogge-Stone adder (8-Kogge-Stone adder (8-stage)stage)
Adder Pitch (um)
10
Interconnect Cap
(fF/um) 0.157
Gate Cap (fF/um)
1.15
Avg inp. Cap /gate (um)
14
% int to gate
cap/pitch I10%
Inv. L.E. 2.24Parasitic delay 3.8
Design Parameters
D = 8*(GBH)1/8*2.2 + 3.8*P
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
18
MXA2 – Architecture & Result
• Multiplexer-based• Generate carries
using radix-2 (P,G)• 4-bit conditional sum
selected by carries• 4-b cell width = 17m• 9-stage critical path
– Per-stage effort = 3.7– Total effort delay =
33.3– Total parasitic = 22.5– Total delay = 55.8
PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4
S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4
60..6356..5952..5548..5144..4740..4336..3932..3528..3124..2720..2316..1912..158..114..70..3
S1 0
S
1 0S
10
G01G23
2
a3 a1a2 b2 a0 b0a3 b3 a2 b2 b0 a0 b1 a1
2
2
P03P03
p3p3
P23P23
G03
PG Group
S10
S
1 0
S10
S10
S10
S10
S10
p0
Sum0Sum1Sum2Sum3
p1g0p2
p3
G01
g2 g2 g1 a0 b0
a1 b1a2 b2
G01
Cin
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
19
(p,g)XOR2
NAND2NOR2OAI
CM6CM1NAND2
AOINOR2OAI
CM2 CM3NAND2
AOINOR2OAI
CM4 CM5
AOI
OAI
CMo
XOR2NAND2
XOR2
XOR2
SumCiNEvenbits
Oddbits
HC2 – ArchitectureHC2 – Architecture• Generate even carries
using radix-2 (P,G)• Generate odd carries
from even carries• CMOS adder for sum• 1-b cell width 4m• 10-stage critical path
4 3 02 114 7 663 3031 15... ... ...
L2
L4
L6
L1
L3
L5
562Odd
Sum ... ... ...
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
20
HC2 – Circuits & HC2 – Circuits & ResultsResults
pi gi-1 gi
G
pi gi-1 gi
G
pi pi-1
P
pi pi-1
P
a b a b
g p
P Cin
Sum
CK
Gi
Gi-1
G
Pi
CKPi
Ai
Bi Gi-1
Pi
Gi
G
Gi-1
Gi
Pi-1CK
Gi
Ai Bi
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 2.8t 28.0t 34.5t 62.5t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
21
KS2 – Architecture & KS2 – Architecture & ResultsResults
• Generate carries using radix-2 (P,G)
• CMOS adder for sum• Similar circuits as
HC2• 1-b cell width 4m• 9-stage critical path
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 3.0t 27.0t 30.6t 57.6tDynamic 2.11t 19.0t 23.6t 42.6t
4 3 02 114 7 615 ...
L2
L4
L6
L1
L3
L5
5
Inv
Sum ...
13...
...
...
...
3031 2963 62
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
22
63 62 5961 60 4 3 02 18 57 648 1632 12... ...... ... ...
G4P4
G16P16
CoSum
KS4 – KS4 – ArchitectureArchitecture
• Generate carries using redundant radix-4 (P,G)• Dynamic circuit• 1-b cell width 4m• 6-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
23
CK G4
A3
B3
A2
B2
A1
B1 B0
A0
B1 A1
A3
B3
A3
A2
B3
B2
A3
B3
A2
A3
B2
B3
A3
B3
A2
B2
A1
B1 A0
A1 B1
B0
P4CK
CK
CK G16
CK
g3 g2 g1 g0
p1
g3 p2
p1
g3 p2
p3
p1CK
g3 g1g2 g0
CK P16G3 P2
P3 HS
STB
HSN
Sum
CK P1
G3 G2 G1 G0
CK
KS4 – Circuits & KS4 – Circuits & ResultResult
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 2.3t 13.8t 16.3t 30.1t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
24
b32
b0
b16
b48 b15
b31b47
b63
Cin = C0
C48
C16C32
C4
C8
C12
C20
C24
C28C36
C40
C44
C52
C56
C60
PGC PGC PGC
PGC PGC
PGC PGC PGC PGC PGC
C
PGC
PGC
PGCPGCPGCPGC
PGC
PGC PGC PGC
(P,G,C) Network
G-PathP-Path
CLA4 – CLA4 – ArchitectureArchitecture• Generate carries using radix-4 (P,G,C)
• 1-b cell width 4m• 15-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
25
A
B
AAN
CK
BNB
CK
G P K
AN
BN
CK CK
CK Sum
CiN
STBpg
Ci
CLA4 – Circuits & CLA4 – Circuits & ResultResult
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 1.4t 21.0t 33.3t 54.3t
G0 G1 G2 G3P0 P1 P2 P3
C0
P2:0 P3:0P1:0
G2:0 G3:0G1:0
C2 C3C1
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
26
LNG4 – LNG4 – ArchitectureArchitecture• Generate carries using Ling pseudo-carries
• Conditional sums selected by local & long carries
• 1-b cell width 5.1m; 9-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
27
LNG4 – Circuits & LNG4 – Circuits & ResultResult
A0
B0
A1 B1A1
B1
A2
B2
A2 B2
CK G3
G4
CK
A3
B3P4
A2 B2
B3A3B1
A0 B0
A1
CK
CK
P
LCH LCL
C1H C0LC1L C0H
SumH
CK
K
G
SumL LCH LCLC1H C0LC1L C0H
CK
P2
P1G0
CK LC
G2G1
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 2.4t 21.6t 22.3t 43.9t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
28
Results from SimulationResults from Simulation
2.70.1
0.50.4
1.3
0.51.4-0.9
0
2
4
6
8
10
12
14
16
KS CS HC KS-4 KS-2 Ling HC CLA
HS
PIC
E &
Diff
eren
ce (F
O4)
• Fairly consistent with logical effort analysis
• Per-stage delay– 1.4 FO4 (static)– 0.8 FO4 (dynamic)
Type Adder # Stages LE (FO4) SPICE (FO4) Diff (FO4)Static KS2 9 11.8 10.9 -0.88
MX2 9 11.4 12.8 1.41HC2 10 12.8 13.3 0.46
Dynamic KS4 6 6.2 7.4 1.27KS2 9 8.7 9.2 0.44
LNG4 9 9.0 9.5 0.51HC2 10 9.8 9.9 0.08
CLA4 16 11.4 14.2 2.74
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
29
Delay of Representative 64-b Delay of Representative 64-b AddersAdders
0
2
4
6
8
10
12
MXA2 HC2 KS2 QTA2 KS4 LNG4
Tota
l Del
ay (F
O4)
StaticDynamic
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
30
What happened when Power is considered ?
Delay
Energy
A
B
Adder A
Adder B
Region 1 Region 2
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
31
What happened when Power is What happened when Power is considered ?considered ?
Delay
Energy
A
B
Adder A
Adder B
Region 1 Region 2
A’ B’
A”
B”
Speed of A Speed of B
A isfaster
Lesspower
Point where B becomesbetter than A
With better E-Dtradeoff B canachieve more
speed with lesspower than A
• Must look at Energy-Delay Space of designs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
32
Energy-Delay SpaceEnergy-Delay SpaceEnergy
Delay
Emin
Dmin
speed barrier
power limit
Different Adders
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
33
Logical Effort in Energy-Delay Logical Effort in Energy-Delay SpaceSpace
Total Delay
Ener
gy
LE Point
lower stage-effort
higher stage-effort
• It is possible to lower energy by trading delay? or …
Most design approaches focus here
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
34
Logical EffortLogical Effort
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
35
Delay in a Logic GateDelay in a Logic GateDelay of a logic gate has two components
d = f + p
• Logical effort describes relative ability of gate topology to deliver current (defined to be 1 for an inverter)
• Electrical effort is the ratio of output to input capacitance
parasitic delay
effort delay, stage effort
f = ghlogical effort
electrical effort = Cout/Cin
electrical effortis alsocalled “fanout”
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
36
Logical Effort Parameters: Logical Effort Parameters: InverterInverter
• d = gh + p• Delay increases linearly with fanout• More complex gates have greater g and p
02468
10121416
0 1 2 3 4 5 6
p=3.8ps (parasitic delay)
Fanout: h =Cin/Cout
Del
ay
d=gh+p
g=2.2 (logic effort)
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
37
Normalized Logical Effort: InverterNormalized Logical Effort: Inverter
•Define delay of unloaded inverter = 1 •Define logical effort ‘g’ of inverter = 1•Delay of complex gates can be defined w.r.t d=1
1
2
3
4
5
6
1 2 3 4 5
parasitic delay
effortdelay
Fanout: h = Cout/Cin
Nor
mal
ized
del
ay: d
inverte
r g = p =d =
1 1gh + p = h+1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
38
Computing Logical EffortComputing Logical EffortDEF: Logical effort is the ratio of the input capacitance to
the input capacitance of an inverter delivering the same output current•Measured from delay vs. fanout plots of simulated gates
•Or estimated, counting capacitance in units of transistor W
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
39
L.E for Adder GatesL.E for Adder Gates
0.005.00
10.0015.0020.0025.0030.0035.00
0 1 2 3 4 5 6Fanout
Dela
y (p
s)
Inverter
Static CM
Dyn PG
Dyn CM
Mux
• Logical effort parameters obtained from simulation for std cells• Define logical effort ‘g’ of inverter = 1• Delay of complex gates can be defined w.r.t d=1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
40
Normalized L.ENormalized L.E
• Logical effort & parasitic delay normalized to that of inverter
Gate type Logical Eff. (g)Parasitics
(Pinv)
Inverter 1 1
Dyn. Nand 0.6 1.34
Dyn. CM 0.6 1.62
Dyn. CM-4N 1 3.71
Static CM 1.48 2.53
Mux 1.68 2.93
XOR 1.69 2.97
*from Mathew Sanu
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
41
Delay of a string of gatesDelay of a string of gates
•Delay of a path, D = di = gihi + pi
•gi & pi are constants
•To minimize path delay, optimal values of hi are to be determined
D is minimized when each stage bears the same effort, i.e. gihi = g i+1h i+1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
42
Minimizing path delayMinimizing path delay• Logical Effort of a string of gates:
• Path Electrical Effort:
• Branching Effort
• Path Branching Effort:
• Path Effort: F=GBH
giG = Cout(path)
Cin(path)H = hi =
biB =
Con-path + Coff-path
Con-pathb =
Delay is minimized when each stage bears the same effort:
f = gihi = F1/N
The minimum delay of an N-stage path is: NF1/N + P*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
43
Inclusion of Wire DelayInclusion of Wire Delayinto Logical Effortinto Logical Effort
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
44
Wiring Wiring LoadLoad
•Wiring in hand analysis– Only lumped capacitance included
•Wiring in HSPICE– Short wire: 1-segment -model RC network– Long wire: 4-segment -model RC network– Using worst-case wire capacitance
•Wire length– Estimated from most critical 1-bit pitch
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
45
Modeling interconnect Modeling interconnect cap.cap.• Include interconnect cap in branching factor
Con-path + Coff-path
Con-pathb =
CM0
CM0
Coff-path
Con-path
PG
Add
er b
itpitc
h CM0
CM0Cint
Con-path
PG
Add
er b
itpitc
h
Coff-path
= 2 Con-path + Coff-path+Cint
Con-pathb = = 2+ Cint
Con-path
= 2 + I I : % int. cap to gate cap in 1 adder bitpitch
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
46
Branching
CINCOUT1
COUT2
f0 f1
f2 f3
g0 g1
g2 g3
Logical Effort assumes the “branching” factor of this circuit to be 2. This is incorrect and can create inaccuracies
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
47
CINCOUT1
COUT2
f0 f1
f2 f3
f0 = f1 , f2 = f3
Td1 = (f0 + f1 + parasitics) Td2 = (f2 + f3 + parasitics)
g0 g1
g2 g3
Minimum Delay occurs when Td1 = Td2
Correction on Branching
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
48
F1g0 g1 out1
CinF2
g2 g3 out2Cin
B1F1 F2
F1
B1g0 g1 out1 g2 g3 out2
g0 g1 out1
B2F1 F2
F2
B2g0 g1 out1 g2 g3 out2
g2 g3 out2
““Real” Branching CalculationReal” Branching Calculation
Branching only equals 2 when:
g0 g1 out1 g2 g3 out2
This explains why we had to resort to Excel !
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
49
Technology Characterization
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
50
Characterization Setup• Logical Effort Requirements:
– Equalize input and output transitions. • Logical Effort is characterized by varying
the h (Cout/Cin) of a gate. By using a variable load of inverters each gate can be characterized over the same range of loads.
• The Logical Effort of each gate is characterized for each input.
• Energy is characterized for each output transition of the gate caused by each input transition.
i.e. for an inverter: energy is measured for tLH and tHL
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
51
LE Characterization Setup LE Characterization Setup forfor
Static Gates Static Gates
Gate Gate Gate GateIn
•tLH
•tHL
•Average•Energy
..
Variable Load
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
52
LE Characterization Setup LE Characterization Setup forfor
Dynamic Gates Dynamic Gates
Gate GateIn
•tHL
•Energy
Variable Load
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
53
LE Table (Static LE Table (Static CMOS)CMOS)
• Technology: P/N Ratio = 2 INV = 3.67, pINV = 4.29• Measured on worst-case single-input switching
Fan-out INV NAND2 NAND3 NOR2 TGXORi TGXORs TGM UXi TGM UXs AOI OAI2 11.6 16.3 22.2 20.5 34.9 22.3 8.0 26.0 23.2 21.33 15.3 20.0 26.6 25.4 42.6 28.2 9.9 33.0 28.5 26.74 19.0 24.0 31.2 30.6 50.2 34.2 12.0 39.0 34.1 32.16 26.4 32.4 40.6 41.1 64.4 45.7 16.0 53.0 45.3 43.68 33.6 40.6 50.0 51.9 79.8 56.5 20.2 68.0 56.7 55.3
g (ps) 3.67 4.08 4.65 5.25 7.43 5.71 2.04 6.97 5.60 5.68p (ps) 4.29 7.90 12.74 9.77 20.19 11.12 3.85 11.76 11.82 9.69
g (norm) 1.00 1.11 1.27 1.43 2.03 1.56 0.55 1.90 1.52 1.55p (norm) 1.00 1.84 2.97 2.28 4.71 2.59 0.90 2.74 2.76 2.26
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
54
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
Fanout
Delay
INV
NAND2
NAND3
NOR2
AOI
OAI
Static CMOS Gates: Delay Static CMOS Gates: Delay GraphsGraphs
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
FanoutD
elay
INV
TGXORi
TGXORs
TGMUXi
TGMUXs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
55
Static Gates: Pull-up Delay Static Gates: Pull-up Delay GraphGraph
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 9
Fanout
Del
ayINV
NAND2
NAND3
NOR2
AOI
OAI
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
56
LE Table (Dynamic LE Table (Dynamic CMOS)CMOS)
• Technology:• Minimum-sized keeper included• Measured on all-input switching of worst
pathFan-out DN2 DN3 DN4 Dk1ND2 Dk1NR2 DAOI_A DOAI_O2 9.9 12.7 16.0 13.7 10.6 10.1 8.83 12.6 14.7 19.1 16.7 13.2 12.1 11.34 16.0 18.3 23.2 20.7 16.7 14.7 14.06 21.7 24.7 30.2 27.9 23.2 20.0 19.28 27.3 31.2 37.8 36.1 29.5 24.8 24.0
g (ps) 2.92 3.15 3.65 3.75 3.19 2.49 2.55p (ps) 4.04 5.82 8.46 5.76 3.95 4.86 3.75
g (norm) 0.80 0.86 1.00 1.02 0.87 0.68 0.69p (norm) 0.94 1.36 1.97 1.34 0.92 1.13 0.87
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
57
Dynamic CMOS: Delay Dynamic CMOS: Delay GraphsGraphs
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10
N2
N3
N4
k1ND2
k1NR2
AOI_A
OAI_O
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10
G4
P4
C4
STBSum
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
58
Dynamic CMOS: Delay Dynamic CMOS: Delay GraphsGraphs
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10
LG3
LP4
G4
P4
LC
Lsum
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10
KSG4
KSP4
KSG16KSP16KSSum
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
59
Energy CalculationEnergy Calculation
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
60
Energy Calculation
8X Minimal Size Dyn-NAND
16X Minimal Size Dyn-NAND
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
61
Energy CalculationEnergy CalculationOffset (parasitic+wiring energy) vs. Size (in multiplesof the
gate size)
y = 0.8931x + 4.6411
y = 1.1413x + 10.22
y = 1.6382x + 11.988
y = 0.5538x + 12.338
y = 3.89x + 14.5
y = 1.9595x + 9.621
y = 1.2559x + 6.762
y = 1.0592x + 1.71
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35 40 45
Gate Size (x)
Offs
et
invdgckoai_odaoitgxoraoi_ona2stgmuxsLinear (inv)Linear (dgck)Linear (oai_o)Linear (daoi)Linear (tgxor)Linear (aoi_o)Linear (na2s)Linear (tgmuxs)
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
62
Energy CalculationEnergy Calculation
1218
2436
482.5
5
7.5
10
0.00E+00
2.00E+01
4.00E+01
6.00E+01
8.00E+01
1.00E+02
1.20E+02
1.40E+02
Energy [fJ]
Load [u]
Size
Inverter
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
63
Energy CalculationEnergy CalculationM 1 5 10 15 20 1 5 10 15 200 1.12 5.6 11.2 16.8 22.4 2.51E+00 1.26E+01 2.51E+01 3.77E+01 5.02E+011 2.24 11.2 22.4 33.6 44.8 3.70E+00 1.85E+01 3.70E+01 5.54E+01 7.39E+012 3.36 16.8 33.6 50.4 67.2 4.85E+00 2.42E+01 4.85E+01 7.27E+01 9.70E+013 4.48 22.4 44.8 67.2 89.6 6.16E+00 3.08E+01 6.16E+01 9.24E+01 1.23E+024 5.6 28 56 84 112 7.45E+00 3.73E+01 7.45E+01 1.12E+02 1.49E+025 6.72 33.6 67.2 100.8 134.4 8.74E+00 4.37E+01 8.74E+01 1.31E+02 1.75E+026 7.84 39.2 78.4 117.6 156.8 1.02E+01 5.08E+01 1.02E+02 1.52E+02 2.03E+027 8.96 44.8 89.6 134.4 179.2 1.15E+01 5.75E+01 1.15E+02 1.72E+02 2.30E+028 10.08 50.4 100.8 151.2 201.6 1.27E+01 6.36E+01 1.27E+02 1.91E+02 2.54E+029 11.2 56 112 168 224 1.42E+01 7.08E+01 1.42E+02 2.13E+02 2.83E+0210 12.32 61.6 123.2 184.8 246.4 1.55E+01 7.76E+01 1.55E+02 2.33E+02 3.10E+0211 13.44 67.2 134.4 201.6 268.8 1.69E+01 8.44E+01 1.69E+02 2.53E+02 3.37E+0212 14.56 72.8 145.6 218.4 291.2 1.81E+01 9.05E+01 1.81E+02 2.71E+02 3.62E+0213 15.68 78.4 156.8 235.2 313.6 1.97E+01 9.85E+01 1.97E+02 2.96E+02 3.94E+0214 16.8 84 168 252 336 2.09E+01 1.04E+02 2.09E+02 3.13E+02 4.18E+0215 17.92 89.6 179.2 268.8 358.4 2.26E+01 1.13E+02 2.26E+02 3.39E+02 4.52E+0216 19.04 95.2 190.4 285.6 380.8 2.39E+01 1.20E+02 2.39E+02 3.59E+02 4.79E+0217 20.16 100.8 201.6 302.4 403.2 2.53E+01 1.27E+02 2.53E+02 3.80E+02 5.06E+0218 21.28 106.4 212.8 319.2 425.6 2.67E+01 1.34E+02 2.67E+02 4.01E+02 5.34E+0219 22.4 112 224 336 448 2.81E+01 1.40E+02 2.81E+02 4.21E+02 5.61E+02
INV
Output Capacitance (u) Energy [fJ]
Multiplier FactorEnergy Factors
1.211300121 7.39E-01Output Capacitance Factor
NAND-2
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
64
ExamplesExamples
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
65
64-Bit Adders• Han-Carlson (prefix-2, HC2): Static and
Dynamic• Han-Carlson (prefix-2, HC2-2): Dynamic-
Static• Kogge-Stone (prefix-2, KS2): Static and
Dynamic• Kogge-Stone (prefix-2, KS2-2): Dynamic-
Static• Quaternary-Tree (prefix-2, QT2): Static and
Dynamic
Included wire delay, tdelay = 0.7RwireCwire
Included wire energy, Ew = CwireV2
Len (um) 10 20 30 40 60 80 120 160 240 320 480Delay (ps) 0.01 0.04 0.09 0.17 0.38 0.67 1.50 2.67 6.01 10.7 24.1
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
66
Adder
S0
S63
A0
A63
Cwire
Cwire
Test Setup1mm wire
H=(Cin + Cwire)/Cin
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
67
Energy-Delay Estimates
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
68
Adders: EnergyAdders: EnergyEnergy vs. Delay
Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input
0
100
200
300
400
500
600
700
800
900
0 50 100 150 200 250 300
Delay [pS]
Ene
rgy
[pJ]
HC Dynamic (2-2)
KS Dynamic (2-0)
HC Dynamic (2-0)
KS Dynamic (2-2)
KS Static Prefix 2
HC Static Prefix 2
Quarternary Dynamic (2-2)
Quarternary Static
Dynamic: KS, HC
Static
Dynamic-Static
QT
KS
HC
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
69
Dynamic Static Dynamic Static ImplementationImplementation
of Carry-Merge stageof Carry-Merge stage
VDD
Clk
Gi
Gi-1 Pi
VDD
Clk
Gi-2
Gi-3 Pi-2
VDD
Clk
Pi-1 Pi
VDD
Delayed Clk
VDD
Clk
Gi-2
Gi-3 Pi-2
VDD
Clk
Gi
Gi-1 Pi
VDD
Clk
Pi-1 Pi
Static Gate
Regular Domino Implementation Compound-Domino Implementation
inverters to be eliminated
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
70
Energy-Delay comparison of Energy-Delay comparison of 64-bit KS, HC and QT adders64-bit KS, HC and QT adders
0
0.5
1
1.5
2
2.5
3
0.9 1.1 1.3 1.5 1.7 1.9 2.1
Normalized Delay
Nor
mal
ized
Ene
rgy
QT Static
HC Static
KS Static
QT compound-domino
HC compound-domino
KS compound-domino
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
71
Adders: Critical Path EnergyAdders: Critical Path EnergyCritical Path Energy vs. Delay (no internal w ire Energy)
Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input
0
2000
4000
6000
8000
10000
12000
0 50 100 150 200 250 300
Delay [S]
Ene
rgy
[fJ]
HC Dynamic (2-2)
KS Dynamic (2-0)
HC Dynamic (2-0)
KS Dynamic (2-2)
KS Static Prefix 2
HC Static Prefix 2
Quarternary (2-2)
Quarternary Static (2-2)
QT dynamic-static
HC dynamic-staticQT static
KS dynamic-static
HC-dynamic
KS dynamic
HC-staticKS-static
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
72
Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Comparison with Intel Measured Data
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160 180 200
Delay [pS]
Ener
gy [f
J]
Kogge-Stone (2-0) Quarternary (2-2)Intel Kogge-Stone (2-0)Intel Quarternary (2-2)
QT
KS
KS estimated
QT Estimated
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
73
Energy-Delay comparison of 32-bit QT and KS adders: estimated vs.
simulation in 0.10mm technology
0
10
20
30
40
50
60
90 100 110 120 130 140 150 160Delay [pS]
Ener
gy [p
J]
KS [9]
QT [9]
KS Estimate
QT Estimate
55%
35%
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
74
Est. Results: All AddersEst. Results: All Addersw/o Wiresw/o Wires
0E+0
02E
-11
4E-1
16E
-11
8E-1
11E
-10
7 8 9 10 11 12 13 14 15
Delay (FO4)
Estim
ated
Ene
rgy
(J)
sKSsHCsQT9dKSdHCdQT9dQT7dCLAdIBMdLNG
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
75
Est. Results: All Addersw/ Wires
0.0E
+00
5.0E
-11
1.0E
-10
1.5E
-10
2.0E
-10
8 10 12 14 16 18Delay (FO4)
Estim
ated
Ene
rgy
(J).
sKS_LEsHC_LEsQT9_LEdKS_LEdHC_LEdQT9_LEdQT7_LEdIBM_LEdLNG_LE
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
76
Delay [ns]0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ene
rgy
[pJ]
0
10
20
30
40
50
60
70
80
Energy-Delay Trade-offsEnergy-Delay Trade-offs
Initial Design
Optimized Design Worst Case Energy VectorWith 100% Input Activity
EnergySavingDelay
Saving
90nm technology
Collaboration with
Intel AMR
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
77
ConclusionConclusion• Using realistic measures for
comparing various designs leads to better design choices
• Power is as important as speed• Making comparison in Energy-Delay
space is necessary:– power can always be traded for speed
and vice versa• Wire effects are significant• Leakage currents ?