EE241 - Spring 2005bwrcs.eecs.berkeley.edu/Classes/icdesign/ee241_s05/... · 2005. 3. 1. · •...
Transcript of EE241 - Spring 2005bwrcs.eecs.berkeley.edu/Classes/icdesign/ee241_s05/... · 2005. 3. 1. · •...
1
EE241 - Spring 2005Advanced Digital Integrated Circuits
Lecture 11:Voltage Optimization
2
Power /Energy Optimizaton Space
+ Variable VT
Sleep T’s
Multi-VDD Variable VT
+ Input control
Stack effects
+ Multi-VTLeakage
DFS, DVSClock Gating
Logic design
Scaled VDD
TSizing
Multi-VDD
Active
Run TimeSleep ModeDesign TimeEnergy
Variable Throughput/LatencyConstant Throughput/Latency
2
3
♦ Design Parameters• Circuit
(sizing, supply, threshold)
• Logic style(domino, pass-gate, …)
• Block topology (adder: CLA, CSA, …)
• Micro-architecture (parallel, pipelined)
Design Time Optimization of Active PowerDesign Time Optimization of Active Power
topology A
topology B
Delay
En
erg
y/o
p
Source: B. Nikolic
4
Sizing, Supply, Threshold Optimization
Transistor sizing can yield large power savings with small delay penalties
Gate sizing
Beta-ratio adjustments
Stack resizing
IBM EinsTuner
Supply voltage affects both active and leakage energyThreshold voltage affects primarily the leakage
3
5
Sizing, Supply, Threshold Optimization
There exists optimal supply + threshold for each function
In this optimum ESw/ELk ~ 2
Depends on logic depth, activity, function
Technology is not optimal for all blocks
Adjust during the designMultiple supplies, thresholds
Variable throughput applicationsVariable supplies, thresholds
6
Multi-dimensional search
En
erg
y [E
no
m]
Delay [Dnom]
• Well-defined optimization problem• Can get pretty close to optimum with only 2 variables• Getting the minimum speed or delay is very expensive
4
7
Example: W-VDD Optimization for LA Adder
♦ Reference: all paths are critical
♦ Internal energy ⇒ W more effective than Vdd
• W: E(-54%), 2Vdd: E(-27%) at dinc=10%
sizing: E (-54%)dinc=10%
nominalD=Dmin
2Vdd: E (-27%)dinc=10%
8
Multiple Supply Voltages
Block-level supply assignmentHigher throughput/lower latency functions are implemented in higher VDD
Slower functions are implemented with lower VDD
“Voltage islands” as called by IBMSeparate supply grids, level conversion performed at block boundaries
Multiple supplies inside a blockNon-critical paths moved to lower supply voltageLevel conversion within the blockPhysical design challenging
5
9
Multiple Supplies in a Block
Lower VDD portion is shaded
CVS StructureConventional Design
Critical Path
Level-Shifting F/F
Critical Path
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
M.Takahashi, ISSCC’98. “Clustered voltage scaling”
10
Pulsed LCFF
M/S and pulsed half-latch LCFFs (MSHL, PHL)Smaller # of MOSFETs / clock loading
Faster level conversion using half-latch structure
Shorter D-Q path from pulsed circuit
q
ck
ckb ckclk
level conversion
ckb
ckd q (inv.)
ck
ckclk
level conversion
dmo
mf
sfso db
sfso
MN1 MN2
Ishihara, ISLPED’03
6
11
Pulsed LCFF
Pulsed precharge LCFF (PPR)Fast level conversion by precharge structure
Suppressed charge/discharge toggle by conditional capture
Short D-Q path
Ishihara, ISLPED’03
12
Multiple Supply Voltages
Two supply voltages per block are optimal
Optimal ratio between the supply voltages is 0.7
Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF)
An option is to use an asynchronous level converter
More sensitive to coupling and supply noise
7
13V1 = 1.5V, VTH = 0.3V, p(t):lambda
V2 (V)V3 (V)
Po
wer
Red
uct
ion
Rat
io
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
V1 (V)
V
2
(V
)
+
V2 (V)V
3(V
)
Three VDD’s
From Kuroda
14
1.0
0.5
Su
pp
ly V
olta
ge
Rat
io
1.0
0.4
0.5 1.0 1.5V1 (V)
Po
wer
Dis
sip
atio
n R
atio
V2/V1
P2/P1
{ V1, V2 }
V2/V1
V3/V1
{ V1, V2, V3 }
0.5 1.0 1.5V1 (V)
P3/P1
V2/V1
V3/V1
V4/V1
0.5 1.0 1.5V1 (V)
P4/P1
{ V1, V2, V3, V4 }
The more VDD’s, the less power, but the effect saturates.Power reduction effect will be decreased as VDD’s are scaled.Optimum V2/V1 is around 0.7.
Optimum Numbers of Supplies
[Hamada, CICC’01]
8
15
carrygen.
partialsum
gpgen.
5:1MUX
ain
bin
carry
s0/s1
sum
sumb (long loop-back bus)
clk
clock gen.
: VDDH circuit
: VDDL circuit
INV1INV2
0.5pF
sumsel.
2:1MUX
9:1MUX
logicalunit
9:1MUX
ain0
ALU Block Diagram
16
sum
keeperpc
sumb
VDDH
VDDL
INV1 INV2
domino level converter (9:1 MUX)
ain0sel(VDDH)
VDDH
VDDL
Delay of INV1 does not increase
INV2 is placed near 9:1 MUX to increase noise immunity
Level conversion is done by a domino 9:1 MUX
Low Swing Bus & Level Converter
9
17
Ene
rgy
[pJ]
TCYCLE [ns]
Room temp.
The dual-supply technique expands the power-delay optimization space
200
300
400
500
600
700
800
0.6 0.8 1.0 1.2 1.4 1.6
Single-supply
Shared well(VDDH=1.8V)
1.16GHz
VDDL=1.4VEnergy:-25.3% Delay :+2.8%
VDDL=1.2VEnergy:-33.3% Delay :+8.3%
Measured Results: Energy & Delay
18
i1 o1
VDDHVDDL
VSS
Conventional
VDDH circuit VDDL circuit
Distributing Multiple Supply VoltagesDistributing Multiple Supply Voltages
i2 o2i1 o1
VDDH
VDDL
VSS
Shared N-well
VDDH circuit VDDL circuit
i2 o2
10
19
VDDH circuit
VDDH VDDL
VSS
N-well isolation
VDDL circuit
Conventional
(a) Dedicated row
(b) Dedicated region
VDDL Row
VDDH Row
VDDH Row
VDDL Row
VDDHRegion
VDDLRegion
20
VDDH circuit
VDDH
VDDL
VSS
Shared N-well
VDDL circuit
Shared-Well
(a) Floor plan image
VDDL circuit
VDDH circuit
11
21
Reducing the Supply Voltage:Concurrency versus Clock Speed
Example: reference datapath
22
Parallel Data Path
12
23
Pipelined Data Path
24
A Simple Data Path: Summary
13
25
MAC
Unit
Addr
Gen
µP
Prog Mem
Embedded Processor
(lpArm)
Direct MappedHardware
EmbeddedFPGA
DSP(e.g. TI 320CXX )
Fle
xibi
lity
Energy
ReconfigurableProcessors
(Maia)Factor of 100-1000
100-1000 MOPS/mW
10-100MOPS/mW
.5-5MIPS/mW
Brodersen & Rabaey
Architecture Choices
26
Two Types of Processing
Fixed-rate processing (e.g. signal processing for multimedia or communications)
Stream-based computation
No advantage in obtaining throughput in excess of the real-time constraint
Variable-rate or burst-mode computation (e.g. general purpose computation)
mostly idle (or low-load) with bursts of computation
Faster is better
14
27
Workload
Ene
rgy
VTmax/V
DD
Variable-rate processingVoltage as a design variable
Adapting voltage to workload yields cubic reduction!
28
Common Design Approach: Fixed VDD
Compute ASAP:
Deliv
ered
Thr
ough
put
Clock Frequency Reduction:
Excessthroughput
Always high throughput
Energy/operation remains unchanged…while throughput scaled down with fCLK
fCLKReduced
time
time
15
29
Dynamic Voltage Scaling (DVS)
time
• Dynamically scale energy/operation with throughput.• Always minimize speed → minimize average energy/operation.• Extend battery life up to 10x with the exact same hardware!
Vary fCLK,VDD
Deliv
ered
Thro
ughp
ut
1 2 Dynamically adapt
BurdISSCC’00
30
Adaptive Supply Voltages
16
31
Variable Algorithmic Workload
32
Typical MPEG IDCT Histogram
17
33
Required speed ∝ ƒ0 0.2 0.4 0.6 0.8 1
No
rmal
ized
pow
er P
∝ƒV
2
0
0.2
0.4
0.6
0.8
1
Dynamic Power Reduction ThroughSoftware-Hardware Cooperation
ControllerController
Clock & VDD
Requiredspeed
Processor
Software
S. Lee et al, DAC, June 2000
If you don’t need to hustle, relax and save power.
HardwareSuper-linear
34
Processor: Converter Loop Sets VDD, fCLK
RST
Counter
Latch
Digital Loop Filter
L
CDD
VDD
PENAB
NENABΣ FERR
FMEAS
f1MHz
0110
100 FDES
+Register
fCLK
Ring Oscillator
V BAT
Processor
IDD
• Feedback loop sets VDD so that FERR → 0.• Ring oscillator delay-matched to CPU critical paths.• Custom loop implementation → Can optimize CDD.
7
Buck converter
Set byO.S.
BurdISSCC’00
18
35
100
80
60
40
20
00 1 2 3 4 5 6
Dhry
ston
e 2.
1 M
IPS
Energy (mW/MIPS)
85 MIPS @5.6 mW/MIPS
(3.8V)
6 MIPS @0.54 mW/MIPS
(1.2V)
• Dynamic operation can increase energy efficiency > 10x.
x
Static VDD
Dynamic VDD
BurdISSCC’00
Measured System Performance & Energy
36
1.0
3.5
VDD( fCLK)∝
1.0
3.5
VDD( fCLK)∝
Max.Speed
Idle Low Speed & Idle
Increased speed forshorter process deadlines
200ms/div 200ms/div
• User-interface process: very bursty computation.
• High-latency computation done @ low speed/energy.
Compute ASAP: With Voltage Scheduler:
BurdISSCC’00
DVS for Real Applications
19
37
• ZERO is implemented heuristic algorithm.• Difficult to optimize compute-intensive code (MPEG).•Big drop in energy when less speed required (3.3-4.5x)
MPEG UI AUDIO
Compute ASAP
Optimal
ZERO
Algorithm
Benchmarks
100 % 100 % 100 %
67 % 25 % 16 %
89 % 30 % 22 %
(Normalized Energy)
BurdISSCC’00
Measured Benchmark Energy Consumption
38
Recent DVS-Enabled Microprocessors
Xscale: 180nm 1.8V bulk-CMOS [Intel00]0.7-1.75V, 200-1000MHz, 55-1500mW (typ) Max. Energy Efficiency: ~23 MIPS/mW
PowerPC: 180nm 1.8V bulk-CMOS [Nowka02] 0.9-1.95V, 11-380MHz, 53-500mW (typ)Max. Energy Efficiency : ~11 MIPS/mq
Crusoe: 130nm 1.5V bulk-CMOS [Transmeta03]0.8-1.3V, 300-1000MHz, 0.85-7.5W (peak)
Pentium M: 130nm 1.5V bulk-CMOS [Intel03]0.95-1.5V, 600-1600MHz, 4.2-31W (peak)
20
39
VDD-Hopping
MPEG-4 encoding
No
rmal
ized
pow
er0
0.2
0.4
0.6
0.8
1
2 3 8
# of frequency levels1
Transition time
between ƒlevels
= 200µs
Time
n-th slice finished hereNext milestone
#n #n+1
Application slicing and software feedback guarantee real-time operation.
Two hopping levels are sufficient.
40
Challenge: Design over Wide Range of Voltages
• Circuit design constraints. (Functional verification)
• Circuit delay variation. (Timing verification)
• Noise margin reduction. (Power grid, coupling)
• Delay sensitivity. (Local power distribution)
Design verification complexity similar to
high-performance processor design @ fixed VDD
21
41
Relative Delay Variation
+40
+20
0
-20
Perc
ent D
elay
Var
iatio
n
VDDVT 2VT 3VT 4VT
• Timing verification only needed at min. & max. VDD.• Should also consider Vdd variations
Delay relative to ring oscillator
Gate
Interconnect
DiffusionSeries
Four extreme cases ofcritical paths:
All vary monotonically with VDD.
BurdISSCC’00
42
VDDVT 2VT 3VT 4VT
1
0.8
0.6
0.4
0.2
0Norm
aliz
ed ∂
Dela
y / D
elay
• Design of local power grid (for timing constraints) only need to consider VDD ≈ 2VT.
RVIVVDelay
VV
DelayDelayDelay
DDDDDD
DD
DD⋅=∆∆⋅
∂∂≈∂
)(,)(
BurdISSCC’00
Delay Sensitivity
22
43
• Static CMOS logic.
• Ring oscillator.
• Dynamic logic (& tri-state busses).
• Sense amp (& memory cell).
Max. allowed |dVDD/dt| → Min. CDD = 100nF (0.6µm)
Circuits continue to properly operate as VDD changes
Design for Dynamically Varying VDD
44
VDD
• Static CMOS robustly operates with varying VDD.
Vin = 0 Vout = VDDrds|PMOS
CL
Vout
max. τ = 4ns
0.6µm CMOS: |dVDD/dt| < 200V/µs
Static CMOS Logic
23
45
Ring Oscillator
• Output fCLK instantaneously adapts to new VDD.
60 80 100 120 140 160 180 200 220 240 260
0
1
2
3
4
Volts
Time (ns)
fCLK
VDD
Simulated with dVDD/dt = 20V/µs
46
VDD
Vout
Vin
clk
clk
Volts
Time
VoutVDDFalse logic low: ∆VDD > VTP
Latch-up: ∆VDD > Vbe
Errors
• Cannot gate clock in evaluation state.
• Tri-state busses fail similarly → Use hold circuit.
0.6µm CMOS: |dVDD/dt| < 20V/µs
clk = 1
∆VDD
−∆VDD
Dynamic Logic
24
47
•• Locality of referenceLocality of reference
•• DemandDemand--driven / Datadriven / Data--driven computationdriven computation
•• ApplicationApplication--specific processingspecific processing
•• Preservation of data correlationsPreservation of data correlations
•• Distributed processingDistributed processing
System-Level Issues: Reducing Waste
Avoid switching any capacitance unneededlySharing increases capacitance
48
Clock gating
Requires careful skew control ...Fortunately well handled in todays EDA tools
25
49
DSP/HIF
DEU
MIF
VDE
896Kb SRAM
10
8.5mW
0 155
30.6mW
20 25
Without clock gating
With clock gating
Power [mW]
Clock-gating efficiently reduces power, NOW
Courtesy M. Ohashi, Matsushita, ISSCC 2002, Paper #22.1
90% of F/F’s were clock-gated.
70% power reduction by clock-gating alone.
MPEG4 decoder
50
Pre-computation
Other options:• guarded evaluation• set output directly
Inputs xi … xn are not appliedif pre-computing holds
26
51
Circuit-level Activity Reduction
52
Circuit-Level Activity Encoding
Conditional InversionCoding for Interconnect
27
53
Eliminating Redundant Computations
54
Eliminating Redundant Computations
28
55
Number Representation
56
Number Representation -Accumulator Example
29
57
Two’s Complement vs Sign-Magnitude
58
Reducing Activity by Reordering Inputs
30
59
Resource Sharing Can Increase Activity
60
Ad
d
Ad
d
Reg
iste
r
Application Specific Processing Reduces
“Implementation Overhead”
Application-Specific Processing
31
61
The Architectural Trade-off
108
19.6
5.5
0.022
16-State ViterbiDecoder
Energy per Decoded bit (nJ)
10
4.3
1.8
2,200
64-point FFT
Transforms per second per unit area
(Trans/ms/mm2)
AreaEnergy
16-State ViterbiDecoder
Decode rate per unit area (kb/s/mm2)
64-point FFT
Energy per Transform (nJ)
1501700High-Performance DSP
50436Low-Power DSP
100683FPGA
200,0001.78Direct-Mapped Hardware
(numbers taken from vendor-published benchmarks)Orders of magnitude lower efficiency
even for an optimized processor architecture
62
Towards Heterogeneous Architectures for SOC
Xilinx Vertex ProXilinx Vertex Pro
JanusJanus Chip Chip -- ST Micro and ParadesST Micro and Parades
Berkeley PleiadesBerkeley Pleiades
32
63
• Voltage as a Design VariableMatch voltage and frequency to required performance
• Minimize waste (or reduce switching capacitance)Match computation and architecture Preserve locality inherent in algorithmExploit signal statisticsEnergy (performance) on demand
More easily accomplished in applicationMore easily accomplished in application--specific thanspecific thanprogrammable devicesprogrammable devices
Reducing Active Dissipation:Summary