METHODOLOGY Open Access An efficient procedure for protein ...
Methodology for EnergyMethodology for Energy-Efficient...
Transcript of Methodology for EnergyMethodology for Energy-Efficient...
Methodology for Energy-Efficient Design of Digital Circuits
Methodology for EnergyMethodology for Energy--Efficient Efficient Design of Digital CircuitsDesign of Digital Circuits
Vojin G. Oklobdzija,
Advanced Computer Systems Engineering LaboratoryUniversity of California / University of Texas
TxACE: Center of Excellence for Analog Circuitshttp://www. acsel-lab.com
IEEE SSCS, Distinguished Lecture Santa Clara Valley Chapter
April 15, 2010
*with acknolwedgment to contributions of my former students: Dr. Hoan Dao, AMD, Austin, TX and
Dr. Bart Zeydel, Plato Networks
April 19, 2010 Energy-Efficient CMOS Circuit Design2
Summary of the PresentationSummary of the Presentation
Energy Efficiency of Digital CMOS Circuits
o Problems
o Energy-Delay Relationship
o Minimizing Energy for a given delay
o Methodology
o Determining the best structures for high-performance system
o Implications on the architecture
April 19, 2010 Energy-Efficient CMOS Circuit Design3
Challenges in HighChallenges in High--Performance DesignPerformance DesignOptimizing for power – not speed ! (or maximizing speed under the power budget)
Logical Effort (LE) optimizes for speed, regardless of power i.e. brings us in the worse energy spot.
Our method optimizes for:powerpower @ given @ given speed speed or speedspeed @ given @ given power power We are developing new approaches for power efficiency (overlooked by delay optimization) applicable to:- Circuit structures- Design techniques- Energy-Delay Space- Creation of optimal Standard cell (ASIC) libraries
April 19, 2010
Motivation for Energy Efficient DesignMotivation for Energy Efficient Design
4 Energy-Efficient CMOS Circuit Design
Power density passed the level found in the Nuclear Reactor ! Power density degrades the reliability and speed.
Shekhar Borkar
April 19, 2010October 9, 2009 Energy-Efficient CMOS Circuit Design5
April 19, 2010 Energy-Efficient CMOS Circuit Design6
EnergyEnergy--Delay RelationshipDelay Relationship
April 19, 2010 Energy-Efficient CMOS Circuit Design7
EnergyEnergy--Delay Space ViewDelay Space View
Delay
Energy
A
B
Adder A
Adder B
Region 1 Region 2
A’ B’
A”
B”
Speed of A Speed of B
A isfaster
Lesspower
Point where B becomesbetter than A
With better E-Dtradeoff B canachieve more
speed with lesspower than A
• Must look at Energy-Delay Space of designs
April 19, 2010 Energy-Efficient CMOS Circuit Design8
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
EnergyEnergy--Delay Space ViewDelay Space View
High-PerformanceTarget
Low-PowerTarget
Which DesignShould be used?
April 19, 2010 Energy-Efficient CMOS Circuit Design9
• Begin to see characteristics of designs
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
Low-PowerTarget
High-PerformanceTarget
H is changing, w/ Cout=constant
EnergyEnergy--Delay Space ViewDelay Space View
April 19, 2010 Energy-Efficient CMOS Circuit Design10
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
Low-PowerTarget
High-PerformanceTarget
• Begin to see characteristics of designs
EnergyEnergy--Delay Space ViewDelay Space ViewH is changing, w/ Cout=constant
April 19, 2010 Energy-Efficient CMOS Circuit Design11
• Best High-Performance designs are clearly seen• Different than what would be chosen from single point
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
Low-PowerTarget
High-PerformanceTarget
EnergyEnergy--Delay Space ViewDelay Space ViewH is changing, w/ Cout=constant
April 19, 2010 Energy-Efficient CMOS Circuit Design12
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
Low-PowerTarget
High-PerformanceTarget
• Also determines best design for Low-Power Target
EnergyEnergy--Delay Space ViewDelay Space ViewH is changing, w/ Cout=constant
April 19, 2010 Energy-Efficient CMOS Circuit Design13
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKS_LEcdHC_LEcdQT9_LEcdQT7_LEcdKS4-2_LEcdLing_LEcdIBM_LE
No Wire
• Without wire, differences appear large
Contribution of Wire to Delay and Energy Contribution of Wire to Delay and Energy should be examined tooshould be examined too
April 19, 2010 Energy-Efficient CMOS Circuit Design14
0
1E-10
2E-10
3E-10
4E-10
5E-10
7 8 9 10 11 12 13Delay (FO4)
Est
imat
ed E
nerg
y (J
)cdKScdHCcdQT7cdIBMcdLingcdKS4-2cdQT9
With Wire
• Wire strongly impacts selection of “best adders”
Contribution of Wire to Delay and Energy Contribution of Wire to Delay and Energy should be examined tooshould be examined too
April 19, 2010 Energy-Efficient CMOS Circuit Design15
EnergyEnergy
April 19, 2010 Energy-Efficient CMOS Circuit Design16
Where does Logical Effort lead us?Where does Logical Effort lead us?
Total Delay
Ener
gy
LE Point
lower stage-effort
higher stage-effort
• It is possible to lower energy by trading delay? or …
Most design approaches focus here
April 19, 2010 Energy-Efficient CMOS Circuit Design17
Design in EnergyDesign in Energy--Delay SpaceDelay Space
Delay
Ener
gy
Lowest possible energy
E0
Lowest possible dealy D0
(E-E0)(D-D0)=0.2xE0D0
All Possible Designs
Design ADesign B
*P. Penzes, Caltech, PhD 2002, V. Zyuban, ISLPED 2002
April 19, 2010 Energy-Efficient CMOS Circuit Design18
EnergyEnergy--Delay Space of a CircuitDelay Space of a Circuit(Fixed Input Size and Output Load)(Fixed Input Size and Output Load)
En
ergy
[pJ]
Delay [ps]
LE
Minimal Energy-Delay Curve
Cout
Cin
Fixed
Fixed
Variable130nm 1.2V CMOS
April 19, 2010 Energy-Efficient CMOS Circuit Design19
Exhaustive searchExhaustive search
──────────────────────────C. Giacomotto, N. Nedovic, and V. G. Oklobdzija, “The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements”, IEEE JoSSC, vol. 42, no. 6, June 2007.
To obtain the minimalE-D curve do I have to check all possible solutions?
A circuit with 10 transistors and 10 possible size for each transistor requires to check 1010 possible solutions!
April 19, 2010 Energy-Efficient CMOS Circuit Design20
100
150
200
250
300
350
400
450
6 7 8 9 10 11Delay [FO4]
Ener
gy [f
J]
LE (Dmin)
EDPED2
ED3
ED0.5
Cout = 100Cin
Cin = Minimum size invED3
ED2
EDP
ED0.5
constant metric curve
H = Cout / Cin = 100
Points obtainedthrough gate sizing
Hardware Intensity, Victor Zyuban, IBM T.J.Watson
Hardware Intensity: V. Hardware Intensity: V. ZyubanZyuban, IBM, IBM
April 19, 2010 Energy-Efficient CMOS Circuit Design21
• Design choice depends on (E,D) requirements
DesignDesign ObjectiveObjective at at NomialNomial VVDDDD
0
10
20
30
40
50
14 16 18 20 22 24 26 28
Delay (FO4)
Ene
rgy
(pJ)
η=2.1
8-bit ACSFixed Input, Fixed Load(0.13μm, 1.2V CMOS)
Circuit Resizingat Fixed Supply
η=∞(Delay Opt.)
η=0.6
η=6.0
upplysfixedDE
ED
∂∂
−=ηhardware intensity:
Hardware Intensity, Victor Zyuban, IBM T.J.Watson
April 19, 2010
Prior Work on Prior Work on DesignDesign OptimizationOptimization
• Transistor-based [TILOS]– Sizing individual transistors– Growing complexity– Applicable to small blocks only
• Block-based [Zuyban & Strenski]– Blocks: latch & logic– Trading {energy,delay} of blocks– CAD tools– Fixed interface
April 19, 2010
TransistorTransistor--BasedBased ApproachApproach
• Optimization problem: (i=1..M)– Minimize: Area(W1,…,WM) ≈ ΣWi– Constraint: Dworst(W1,…,WM) = T
• Delay modeling– Linear (RC-like): TILOS– Look-up table: AMPS (Synopsys)
• A convex problem ⇒ minimal solution exists– Different polynomial algorithms developed– May have issues with convergence– Long run time with increasing design complexity
April 19, 2010
BlockBlock--BasedBased: : ZyubanZyuban (IBM)(IBM)
• Optimization problem for a pipelined stage:– Minimize energy: E(B1,…,BM) = ΣEBi
– Delay constraint: D(B1,…,BM) = ΣDBi = T
• Optimal criteria:– (wi / ui) · ηi = (wk / uk) · ηk for any (i,k)
with wx = EBx/E, ux = DBx/T, ηx = (∂EBx/EBx)/(∂DBx/DBx)
• Similar criteria for a simple pipelined system
[Zyuban, IBM T.J. Watson Research]block B1 block B2
block BM
April 19, 2010
Application: Application: SolutionSolution VerificationVerification
• Verify optimality of solution:– Block 1: (w1/u1)·η1 = 2.0– Block 2: (w2/u2)·η2 = 2.0
[Zyuban, IBM T.J. Watson Research]
w1=0.5u1=0.2
w2=0.5u2=0.8
Block 1Block 2
LoadInput
wi = Ei / E; E = ΣEiui = Di / D; D = ΣDi
η1=0.8 η2=3.2
Equal ⇒ optimal!
April 19, 2010
ApplicationApplication: : SolutionSolution VerificationVerification
• If η1 = 3.2 and η2 = 0.8– Block 1: (w1/u1)·η1 = 8.0– Block 2: (w2/u2)·η2 = 0.5– Better solution? Relax η1 and increase η2
w1=0.5u1=0.2
w2=0.5u2=0.8
Block 1Block 2
LoadInput
wi = Ei / E; E = ΣEiui = Di / D; D = ΣDi
Unequal ⇒ not optimal
April 19, 2010
MajorMajor LimitationLimitation
• Zyuban’s assumption:– Delay & energy independence of each block Bi
• Single path: block ≡ gate
• Similar dependency between blocks and pipelines• No analytical solution if accounting dependency
⎪⎪⎩
⎪⎪⎨
⎧∝∝
capgatenext:Ccapgatecurrent:C
}C,C{EEnergy}C,C{TDelay
out
in
inout
inoutd
: energy, delay dependencyof adjacent gates
B1 B2 B3 B4
Input Output
CLoad
April 19, 2010
Proposed StageProposed Stage--Based ApproachBased Approach
• Stage ≈ logic depth
• Gates → stage– Based on maximal distances to input and output– Stage delay: dstage = max{dgate}– Stage energy: Estage = ΣEgate– Estimated from gate energy & delay models
April 19, 2010
• Logical Effort delay model:
Delay and Energy Modeling of GatesDelay and Energy Modeling of Gates
( ) ( )τκκκ
κκ
pghCRCRCR
CC
CRCR
T invinvinvinv
pargate
gate
out
invinv
gategategated +=
⎥⎥⎦
⎤
⎢⎢⎣
⎡+= o,
logical effort g(gate dependent)
electrical effort h(load dependent)
parasitic p(gate dependent)
CoutCin =Cgate
Cpar
Rup=Rgate
Rdown=Rgate
stage effort f = g•h
reference delay τ(technology dependent)
inCKlinpoutggate WTPWEWEWE ++=)(
Leakage EnergyActive Energy
Energy model:
Pipelined Stage OptimizationPipelined Stage Optimization
April 19, 2010
StageStage--Based OptimizationBased Optimization
• Optimization functions– Delay: D = ΣDStage(i)– Energy: E = ΣEStage(i)
• Possible design constraints– Delay target, D– Input size, Winput– Output load, Cload
• Posynomial Problems – Solvable with polynomial algorithms
April 19, 2010
Problem A: Delay OptimizationProblem A: Delay Optimization
• Optimization problem– Minimize: D = ΣDSi– Constraint: {Input, Load} = const.
• Objectives– Obtain minimally achievable delay, Dmin– Wanted in performance-critical designs– Disregard energy consumption (actually, ∂Ei /∂Di = ∞)
April 19, 2010
Single Path: Logical EffortSingle Path: Logical Effort
• Solution = equal stage effort f (i.e. fan-out)
N/
in
LoadN
iiopti C
Cgff1
1⎥⎦
⎤⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛== ∏
=
∑=
+⎟⎟⎠
⎞⎜⎜⎝
⎛+=
N
ii
i
ii p
CCgD
1
1Ci = input cap of gate iCin = C1 : fixedCN+1 = CLoad : fixed
∑=
+=N
iioptmin pfND
1
dD/dCi = 0 (∀i = 1..N):
Path Gain, H
April 19, 2010
Energy Cost vs. Total DelayEnergy Cost vs. Total Delay
S1 S2 S3 S4
InputOutput
CLoad= 92CCin
Cin=C1.71C2.44C3.16C
4.16C5.6C
7.7C10.9C
16C
0
200
400
600
800
1000
1200
1400
14 16 18 20 22 24Delay (tau)
Tota
l Ene
rgy
(fJ)
Fixed Load = 92C
better performance ⇔larger input, more energy
April 19, 2010
MultiMulti--Path CircuitsPath Circuits
• Optimal delay depends on off-path load
∑
∑
=
+
+
++
=
++
⎟⎟⎠
⎞⎜⎜⎝
⎛+
+=
⎟⎟⎠
⎞⎜⎜⎝
⎛+
+=
N
ii
i
i
i
i,offii
N
ii
i
i,offii
pC
CC
CCg
pCCC
gD
1
1
1
11
1
11
Ci = input cap of gate iCoff,i = input cap of i th off-path gate
C1 C2 C3 C4
Input Output
CLoadCin
Coff,2 Coff,3 Coff,4
bi+1 = branching factor at (i+1)th-gate input
on-path
off-paths
April 19, 2010
Linear BranchingLinear Branching
• Linear branching: Coff,i / Ci = const.
• Similar analytical form for solution
N
in
LoadN
ii
N
iiopti C
Cbgff/1
11⎥⎦
⎤⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛== ∏∏
==
dD/dCi = 0 (∀i = 1..N):
∑=
+=N
iioptmin pfND
1
C1 C2 C3 C4
Input Output
CLoadCin
Coff,2 Coff,3 Coff,4
Branching Factor, B
April 19, 2010
NonNon--Linear BranchingLinear Branching
• Nonlinearity due to:– Constant off-path load (wire cap, min-size gates)– Unequal path lengths– Parasitic delay difference of gates
• No analytical form– Recursive solving– Solution: unequal stage efforts
C1 C2 C3 C4
Input Output
CLoadCin
Coff,2 Coff,3 Coff,4
April 19, 2010
6464--bit Static bit Static KoggeKogge--Stone AdderStone Adder
long wiringwiring
intermediate wiringintermediate wiring
intermediate wiringintermediate wiring
short wiringshort wiring
short wiringshort wiring
short wiringshort wiring
8 S
tage
s
g = aibi ; p = ai+bi
G = gi+pigi-1 ; P = pipi-1
Sum = ai ⊕ bi ⊕ ci
April 19, 2010
0
1
2
3
4
5
6St
age
Effo
rt, f
Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7 Stage8
Wire Ignored
Wire Included
6464--b KS: Stage Effort Distributionb KS: Stage Effort Distribution
• Nonlinearity causes unequal stage efforts– Nonlinear factors: wire load, parasitic delay diff.
Input = 3μmLoad = 60μm
wireeffect
parasiticdifference
April 19, 2010
6464--b KS: Energy versus Delayb KS: Energy versus Delay
0
50
100
150
200
250
300
11 12 13 14 15 16Delay (FO4)
Ene
rgy
(pJ)
Wire Ignored
Wire Included
wireeffect
Input = {30, 20, 15, 10, 6, 3}μm
exponentialbehavior
• Significant wire effect on delay & energy
April 19, 2010
Problem B: EnergyProblem B: Energy--Delay TradeoffDelay Tradeoff
• Optimization Problem– Minimize: E = ΣEStage(i)– Constraint: {Input, Load} = const.
ΣDStage(i) = Dmin + ΔD• Objectives
– Avoid infinite energy sensitivity at Dmin– Equalize energy-delay sens.: (∂E/∂D)Si = (∂E/∂D)Sk– Trade delay for energy (traditional approach)
April 19, 2010
EE--D TradeD Trade--off: Single Pathoff: Single Path
S1 S2 S3 S4
InputOutput
92CinCin=1
0
100
200
300
400
500
600
20 25 30 35Delay (tau)
Ener
gy (f
J)
D_min
η = 1.0 @D = 1.18Dmin
gooddelay trading
for energy less energy savingfor extra traded delay
(infinite energy sensitivity)
April 19, 2010
0
2
4
6
8
10
12
14
16
18
Stag
e Ef
fort
, f
f_Stage1 f_Stage2 f_Stage3 f_Stage4
EE--D TradeD Trade--off: Stage Effort off: Stage Effort DistribDistrib..
logical effort
S1 S2 S3 S4
InputOutput
92CinCin=1
unequal distribution of ΔD to stages
ΔD: small→large
April 19, 2010
0
50
100
150
200
250
300
350
400
Ener
gy (f
J)
E_Stage1 E_Stage2 E_Stage3 E_Stage4
EE--D TradeD Trade--off: Energy off: Energy DistribDistrib..
S1 S2 S3 S4
InputOutput
92CinCin=1
logical effort
minimal overallenergy reduction
(*) Including constant output load
ΔD: small→large Constant Constant LoadLoad
EnergyEnergy
April 19, 2010
6464--b KS: Energy Delay Tradeoffb KS: Energy Delay Tradeoff
• Dmin solution is very inefficient in energy• 55% energy saving with 10% delay traded• Solution = equal stage energy-delay sensitivity
0
100
200
300
400
11 12 13 14 15 16Delay (FO4)
Ene
rgy
(pJ) Fixed Input & LoadPossible Design
Dmin
EfficientE-D Curve
55%
Ener
gy
10%delay
April 19, 2010 Energy-Efficient CMOS Circuit Design46
Design in EnergyDesign in Energy--Delay SpaceDelay Space
LE
Delay
Ener
gy
Decrease Stage Effort
Increase Stage Effort
If this is your design point – the drop is steep !But you should not be
designing there!!
For a small sacrifice in delaythe energy savings are big !
From:R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic, V. Stojanovic, "Methods for True Power Minimization," International Conference on Computer-Aided Design, ICCAD-2002, Digest of Technical Papers, San Jose, CA, November 10-14, 2002, pp. 35-42.
April 19, 2010
Problem C: Energy MinimizationProblem C: Energy Minimization
• Problem:– Minimize: E = ΣESi
– Constraint: D = ΣDSi = const.Load = const.
• Objective:– Obtain absolute minimal energy @ given delay– Equalize energy-delay sens.: (∂E/∂D)Si = (∂E/∂D)Sk
– Trade input size such that (∂E /∂Input)D = 0
April 19, 2010
Single Path: Stage Effort Redistrib.
S1 S2 S3 S4
InputOutput
92CCin
0123456
789
Stag
e Ef
fort
, f
f_stage1 f_stage2 f_stage3 f_stage4
Logical Effort
Min-Energy
Σf = constredistribution ofstage efforts tofit stage energy
April 19, 2010
Single Path: Energy DistributionSingle Path: Energy Distribution
0.75E* @4.9x Input
(*) Including energy consumed on output load
0
50
100
150
200
250
300
350
400
Ener
gy (f
J)
E_stage1 E_stage2 E_stage3 E_stage4
Logical Effort
Min-Energy
• 25% less energy• Energy reduction in heavily-loaded stages
S1 S2 S3 S4
InputOutput
92CCin
Constant Constant LoadLoad
EnergyEnergy
April 19, 2010
6464--b KS: Minimal Energy vs. Delayb KS: Minimal Energy vs. Delay
• 30 - 50% energy saving @ same performance• 1.6 - 3.6X input size
0
50
100
150
200
250
300
11 12 13 14 15 16 17Delay (FO4)
Ene
rgy
(pJ)
Delay Optimized
Energy Minimized
W LOAD = 60 μ m
38%Energy
Pipelined System OptimizationPipelined System Optimization
April 19, 2010
Pipelined System OptimizationPipelined System Optimization
• Design constraints– Delay target– External I/O constraint
• How to obtain minimal-energy solution?– Pipelined stages
• Minimized for energy• Sensitive to input and load variations
– System level• Balancing energy sensitivities at pipelined boundaries
– Recursive process
April 19, 2010
Pipelined Stage: Efficient EPipelined Stage: Efficient E--D AreaD Area
0
50
100
150
200
250
300
11 12 13 14 15 16 17Delay (FO4)
Ener
gy (p
J)
Energy Rangefor Designs
Achieving the Same Delay
Energy-Minimized Design Points(Lowest Energy Achievable)
30μminput
Smaller H = Cout/Cin
Load = 60 μ m gate width
(64-bit static Kogge-Stone adder in 0.13μm CMOS at 1.2V)
20μminput
10μminput
6μminput 4μm
input 3μminput
Delay-OptimizedDesign Points(Logical Effort)
• Efficient E-D area is upper- and lower-bounded
April 19, 2010
0
10
20
30
40
50
60
11 12 13 14 15 16 17
Delay (FO4)
Inpu
t Siz
e ( μ
m)
Delay-OptimizedDesign Points(Logical Effort)
Energy-Minimized Design Points
Range of Input Sizefor Achieving the
Same Delay
(64-bit static Kogge-Stone adder in 0.13μm CMOS at 1.2V)Load = 60 μ m gate width
Efficient InputEfficient Input--Delay AreaDelay Area
• Larger/smaller input less/more energy
April 19, 2010
Efficient EnergyEfficient Energy--Input @ D = const.Input @ D = const.
• Energy vs. input size: one-to-one relation– Energy sensitivity to input, σE,Input
• Flat energy in upper half of input range
Input Size
Delay-Opt
Ene
rgy
Energy-Min.
Tradable InputTr
adab
le E
nerg
y
Delay = constLoad = const
fixedLoadInputE Input
E
=∂∂
−=,σ
Delay
Energy
Delay
Input
April 19, 2010
0
50
100
150
200
250
300
10 15 20 2530 35
40
60
80
100Ener
gy (p
J)
Input (μm)
Load
(μm
)
• Energy sens. to output load, σE,Load– More @ upper bound (delay-opt.)– Less @ lower bound (energy-min.)
• Energy components = energy + its sensitivities
Sensitivity to Load @ D = const.
Load
Ene
rgy
(Fixed delay, fixed input)
Almost Linear
Exponential
Lower Boundary
Uppe
r Bou
ndar
y
LoadE ,σInputE ,σfixedInput
Load,E LoadE
=∂∂
=σ
Delay = const
April 19, 2010
System Energy Optimization
• Energy minimization– Pipeline: minimize energy @ given input & load– System: balance energy sensitivities @ boundaries
• Trading elements: input size, output load
• Optimal criteria:⎪⎩
⎪⎨⎧
==
−1ii
i
ALoadEAInputE
AStage nimalmiE
,, σσ
Pipeline 1E1, Do
LoadInput
Boundary B2 Boundary BN
Logic
Pipeline 2E2, Do
Logic
Pipeline NEN, Do
Logic
April 19, 2010
How to Achieve Less Total EnergyHow to Achieve Less Total Energy
Case A: σE,Input (i) > σE,Load (i-1)
• Increase i th input⇒ less ETotal
)i(Load,E)i(Input,Eii nmiEE 11 −− =⇔=+ σσ
(i -1)th Pipeline:{E, D, σE,Input, σE,Load}
Boundary Bi
i th Pipeline:{E, D, σE,Input, σE,Load}
Logic Logic
Case B:σE,Input (i) < σE,Load (i-1)Reduce i th input⇒ less ETotal
April 19, 2010
Case Study: Media Case Study: Media DatapathDatapath
• What is the minimal energy solution?
Maximal Input = 30μm
Target Delay = 17FO4
Output Load = 60μm gate width
April 19, 2010
Case Study: Optimal CriteriaCase Study: Optimal Criteria
Simplified Datapath
MAX{X,Y,U,V} = 30μm
Delay = 17FO4
{Z} = 60μm
σE,Load P1 + σE,Load P2= σE,Input P3
E{P1} = min.
E{P2} = min.
E{P3} = min.
Boun
dary
April 19, 2010
Case Study: Optimal AlgorithmCase Study: Optimal Algorithm
Simplified Datapath
April 19, 2010
Case Study: Media Datapath Solution
0
200
400
600
800
1000
1200
4 6 8 10 12Multiplier Load = Adder Input (um)
Ene
rgy
(pJ)
0
50
100
150
200
250
300
Ene
rgy
Sen
sitiv
ity (p
J/m
)
Optimal
37%26%System Energy
EnergySensitivity
M ultipliers
M ediaAdder
BalancedSensitivity
0
50
100
150
200
250
300
3 5 7 9 11 13Multiplier Output Load = Adder Input Size (μm)
Ener
gy S
ensi
tivity
(pJ/
μm)
0
200
400
600
800
1000
1200
Syst
em E
nerg
y (p
J)
Optimal
26%
ene
rgy
savi
ng fr
omde
lay-
optim
ized
add
er
EnergySensitivity
BalancedSensitivities
MediaAdder Multipliers
Region of Design Points
Minimal Energyat Fixed Boundaries
37%
ene
rgy
savi
ng fr
omde
lay-
optim
ized
mul
tiple
rs
SystemEnergy
Media Datapath0.13μm CMOS at 1.2V
Delay = 17FO4
April 19, 2010
System EnergySystem Energy--Delay @ VDelay @ VDDDD=const.=const.
• Similar E-D characteristics as single stages• Possibly less input size @ lower delay
VDD=constant0
100
200
300
400
500
600
700
14 16 18 20 22 24 26
Delay (FO4)
Tota
l Sys
tem
Ene
rgy
(pJ)
η = θ =2.1
MinimalAchievable
E-D forCircuit Sizing
{10,5} μ m
{5,1} μ m
input{Mult,Add} = {3.8,0.8} μ m
{13.7,1.4} μ m
{20,10} μ m{25.0,5.0} μ m
{22.1,3.0} μ m
{15,15} μ m
Zyuban & Strenski'sFixed Pipeline Input
OptimizationRegion of
Design Points
Media Datapath0.13μm CMOS at 1.2V
April 19, 2010
Effect of Supply ScalingEffect of Supply Scaling
• Efficient {E, D, VDD} are dependent• Optimal criterion: θη =
∂∂
=∂∂
systemplyupssizing
orDE
ED
DE
ED
0
50
100
150
200
250
300
350
400
12 17 22 27 32
Delay (FO4)
Tota
l Ene
rgy
(pJ)
0.8V
Optimal (E,D)at 1.2V
Optimal (E,D)at 1.0V
Minimal AchievableE-D for Circuit Sizingand Supply Variation
0.9V1.0V
1.1V
1.2V
1.3V1.4V
1.5V Minimal E-Dfor circuit sizing
at 1.0V supply voltage
Region of Design Points
Media Datapath(0.13μm, 1.2V CMOS)
April 19, 2010
0
100
200
300
400
500
1.0 1.1 1.2 1.3 1.4 1.5 1.6Supply Voltage (V)
Ener
gy (p
J)Media Datapath
900psDelay
NominalSupply
794psDelay2.8X
1.4X
Potential Saving of System Energy Potential Saving of System Energy
• Significant energy saving with correct supply or delay selection
Energy @nominal VDD
Energy @optimal supply
(1.2V, 0.13μm CMOS)
EnergyEnergy--Delay ImprovementDelay Improvementof Pipelined Stagesof Pipelined Stages
April 19, 2010
Architectural AdvantagesArchitectural Advantages
• Feedback bus is typical
• Length = 128μm
≅ N • area≅ N • bus length1/N clock rate
≅ area≅ bus lengthSame clock rate
April 19, 2010
EnergyEnergy--Throughput ComparisonThroughput Comparison
• Pipelining is mostly more efficient in E-D domain!
2x3x
4x5x
Ref
2x 3x 4x 5x
0
2
4
6
8
10
12
14
16
18
20
0 1 2 3 4 5 6 7Throughput (norm)
Ener
gy (p
J/op
)Parallelism
Pipelining
(0.13 μ m, 1.2V CMOS)
More EnergyEfficient
BetterThroughput
ACS Circuit
April 19, 2010
ACS Unit ImplementationNot pipelined8 ACS circuitsLargest areaLeast PM entries per ACS
Pipelined-42 ACS circuits
Small area
Most PM entries / ACS
Pipelined-24 ACS circuitsMidway areaMore PM entries / ACS
Application:64-state rate-½Viterbi decoder
April 19, 2010
Energy-Throughput Comparison
• Less energy for deeper pipeline at given throughput
Pipelining = 1Parallelism = 8
Pipelining = 2Parallelism = 4
Pipelining = 4Parallelism = 2
0
50
100
150
200
15 20 25 30 35 40Clock Cycle (FO4)
Tota
l AC
S En
ergy
(pJ)
Supply ScalingCircuit Sizing
63%
37%Lower Supply Voltage
(0.04V Step)
ACS Unit(0.13μm, 1.2V CMOS)
April 19, 2010 Energy-Efficient CMOS Circuit Design71
Designing a System Designing a System for a Fixed Performance for a Fixed Performance
in the Energyin the Energy--Delay SpaceDelay Space
April 19, 2010 Energy-Efficient CMOS Circuit Design72
Pipelining and Parallelism for an ACS CircuitPipelining and Parallelism for an ACS Circuit
Ideal Degree-N Parallelism•N-time throughput•(Area, routing) overhead
Ideal Depth-N Pipelining•N-time throughput•CSE overhead
April 19, 2010 Energy-Efficient CMOS Circuit Design73
EnergyEnergy--Delay Results for ParallelismDelay Results for Parallelism
1X2X
3X4X
5X6X
7X8X
9X10X
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20 25Delay [FO4]
Ene
rgy
[pJ/
op]
H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the Viterbi ACS Unit Architecture", Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005.
Nominal Supply (1.2V)Varying Supply
DecreasingSupply Voltage
52%EnergySavings
HigherParallelism Degree
1.5V
1.3V
1.2V
1.4V
1.14V
1.08V1.02V
0.96V0.90V
0.84V0.78V
0.72V
Gate Sizes are optimized for Nominal Supply (1.2V)
April 19, 2010 Energy-Efficient CMOS Circuit Design74
5X4X 3X 2X 1X
0
2
4
6
8
10
12
14
0 5 10 15 20 25
Delay [FO4]
Ener
gy [p
J/op
]
H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the Viterbi ACS Unit Architecture", Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005.
61%EnergySavings
HigherPipeline Depth
DecreasingSupply Voltage
Nominal Supply (1.2V)Varying Supply
1.5V
1.3V
1.2V
1.4V
1.14V1.08V
1.02V
0.96V0.90V
0.84V
0.78V 0.72V
Gate Sizes are optimized for Nominal Supply (1.2V)
EnergyEnergy--Delay Results for PipeliningDelay Results for Pipelining
April 19, 2010 Energy-Efficient CMOS Circuit Design75
EnergyEnergy--Delay Estimation Matches Complex Delay Estimation Matches Complex Circuit SimulationCircuit Simulation
0.0001
0.001
0.01
0.1
1
10
0.5 1.0 1.5 2.0 2.5 3.0Delay (normalized to 1.2V data)
Ener
gy (n
orm
aliz
ed to
1.2
V da
ta)
Gates
Adder 1
Adder 2
Multiplier 1
Multiplier 2
Lower Supply(0.1V step)
Act
ive
Leak
age
1.5V1.2V
0.7V
April 19, 2010 Energy-Efficient CMOS Circuit Design76
Energy Efficiency of Architectural ChoicesEnergy Efficiency of Architectural Choices(including supply scaling and circuit sizing)(including supply scaling and circuit sizing)
2x3x
4x5x
1X2x3x
4x5x
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20 25
Delay [FO4]
Ene
rgy
[pJ/
op]
Parallelism
Pipelining
> 52%
Supply ScalingCircuit Sizing
MoreEnergyEfficient
SlightlyBetterDelay
ACS Circuit
H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the Viterbi ACS Unit Architecture", Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005.
April 19, 2010 Energy-Efficient CMOS Circuit Design77
EergyEergy--Delay Results for Parallelism and PipeliningDelay Results for Parallelism and Pipelining
Reference
Parallel-8
Parallel-4
Parallel-2
Pipeline-2Pipeline-4
Parallel-2Pipeline-2
Parallel-2Pipeline-4
Parallel-4Pipeline-2
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25
Delay [FO4]
AC
S E
nerg
y [p
J/op
]depth•degree = 8
depth•degree = 4
Pipelining has the Best EnergyPipelining has the Best Energy--Delay TradeoffDelay Tradeoff
Supply = 1.2V
H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the Viterbi ACS Unit Architecture", Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005.
April 19, 2010 Energy-Efficient CMOS Circuit Design78
Delay
Ener
gy
DelayOptim ized
Energy Optim ized
Is it possible to lower the Energy ?Is it possible to lower the Energy ?
• Reduce Energy for same Delay! • Improve Delay for same Energy!
EnergySavings
April 19, 2010 Energy-Efficient CMOS Circuit Design79
0
50
100
150
200
250
300
350
10 11 12 13 14Delay (FO4)
Ener
gy (p
J)
KS_LE HC_LE
KS_Opt HC_Opt
Achieved EnergyAchieved Energy Savings in KS and HC Savings in KS and HC AddersAdders
• Simulation of 64-bit static adders confirms saving!
50% 45%
April 19, 2010 Energy-Efficient CMOS Circuit Design80
0E+0
01E
-10
2E-1
03E
-10
4E-1
05E
-10
7 8 9 10 11 12 13Delay (FO4)
Estim
ated
Ene
rgy
(J)
cdKS cdKS_LEcdHC cdHC_LEcdQT7 cdQT7_LEcdIBM cdIBM_LEcdLing cdLing_LEcdKS4-2 cdKS4-2_LE
27%
18%52%
21%
26%24%
Achieved EnergyAchieved Energy Savings in Savings in Representative AddersRepresentative Adders
• Comparison of high-performance 64-bit adders- Delay and Energy Optimized
April 19, 2010 Energy-Efficient CMOS Circuit Design81
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0S1S2S3S4S5S6S7S8S9S10S11
0
5
10
15
20
25
30
35
40
45
50
55
60
Gat
e Si
ze (u
m)
Bit
Stag
e15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
S1S2S3S4S5S6S7S8S9S10S11
0
5
10
15
20
25
30
35
40
45
50
55
60
Gat
e Si
ze (u
m)
Bit
Stag
e
• Energy minimization improves hotspots!
Reduction of HotReduction of Hot--SpotsSpots
Reduces Hotspots
Delay Optimized Energy Optimized
April 19, 2010 Energy-Efficient CMOS Circuit Design82
Delay [ns]0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ene
rgy
[pJ]
0
10
20
30
40
50
60
70
80
Accomplishments Accomplishments (June 30, 2003 to June 30, 2004)(June 30, 2003 to June 30, 2004)
Initial Design
Optimized Design Worst Case Energy VectorWith 100% Input Activity
EnergySavingDelay
Saving
90nm technology
Collaboration with
Intel AMR
215μm
Input Flip FlopsInput Flip FlopsInput Flip FlopsBooth EncodingBooth EncodingBooth Encoding
Booth SelectBooth SelectBooth Select
Partial Product TreePartial Product TreePartial Product Tree
Final AdderFinal AdderFinal AdderOutput Flip FlopsOutput Flip FlopsOutput Flip Flops
130μm
April 19, 2010 Energy-Efficient CMOS Circuit Design83
SummarySummary• Energy-Efficient Design requires:
– Early structure comparison in energy-delay space– Early Layout/Floorplanning– Optimization using energy minimization objective
function
• LE does not guarantee a good design
• Our method of energy-minimization focuses on reducing power
• The same principles hold for other logic functions
April 19, 2010
Future Work
• Methodology improvement– General design rules– Algorithms with fast convergence– Guidelines for close-to-optimal solutions
• CAD tool– Custom vs. standard cell library
• Improve gate modeling & characterization– Worst-case vs. single-switching,… or between?– Process variation
April 19, 2010
Publications on Energy-Delay:• Vojin G. Oklobdzija, Bart R. Zeydel, Hoang Dao, Sanu Mathew, Ram Krishnamurthy,
“Energy-Delay Estimation Technique for High-Performance Microprocessor VLSI Adders”, Proceedings of the International Symposium on Computer Arithmetic, ARITH-16, Santiago de Compostela, SPAIN, June 15-18, 2003.
• Hoang Q. Dao, Bart R. Zeydel, Vojin G. Oklobdzija, “Energy Minimization Method for Optimal Energy-Delay Extraction”, Proceedings of the European Solid-State Circuits Conference, ESSCIRC 2003, Estoril, PORTUGAL, September 16-18, 2003.
• V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, R. Krishnamurthy, "Comparison of High-Performance VLSI Adders in Energy-Delay Space", IEEE Transaction on VLSI Systems, Volume 13, Issue 6, pp. 754-758, June 2005.
• Hoang Q. Dao, Bart R. Zeydel, Victor Zyuban, and Vojin G. Oklobdzija, “On Energy Optimization of Digital Systems”, The fourth annual IBM Austin Conference on Energy-Efficient Design, ACEED 2005, Austin, Texas, March 1-3, 2005.
• H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the ViterbiACS Unit Architecture", Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005.
• H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy Optimization of Pipelined Digital Systems Using Circuit Sizing and Supply Scaling", IEEE Transaction on VLSI Systems,Vol. 14, Issue 2, Feb. 2006 pp. 122-134.
• S. K. Hsu, S. K. Mathew, M. A. Anders, B. R. Zeydel, V. G. Oklobdzija, R. K. Krishnamurthy, S. Y. Borkar, “A 110 GOPS/W 16-bit Multiplier and Reconfigurable PLA Loop in 90-nm CMOS”, IEEE Journal of Solid-State Circuits, Vol.41, No.1, January 2006.
• B. R. Zeydel, D. Baran, V. G. Oklobdzija, “Energy Efficient Design of High-Performance VLSI Adders”, IEEE Journal of Solid-State Circuits, June 2010.