© 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A...
-
Upload
gerard-cummings -
Category
Documents
-
view
215 -
download
2
Transcript of © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A...
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power-Aware Design for Power-Aware Design for High-Performance ProcessorsHigh-Performance Processors
A Tutorial at HPCA-2004A Tutorial at HPCA-2004
Kevin SkadronKevin Skadron Jose GonzalezJose Gonzalez University of VirginiaUniversity of Virginia Intel Labs Barcelona Intel Labs Barcelona
22
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
RoadmapRoadmap
Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation
Sources, modeling, reduction techniquesSources, modeling, reduction techniques
Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques
SummarySummary
33
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
IntroductionIntroduction Power: Work done per unit Power: Work done per unit time time (watts)(watts) Energy: Total Work (joules)Energy: Total Work (joules) Why is power a concern in current processorsWhy is power a concern in current processors? ? ??
Increased market demand for consumer electronics powered by batteries; Increased market demand for consumer electronics powered by batteries; battery life is a selling pointbattery life is a selling point
Electricity, cooling costs for large data centers are becoming substantialElectricity, cooling costs for large data centers are becoming substantial• 5-25% of data center 5-25% of data center incomeincome (cf. Rajamony & Bianchini tutorial, ICS’02) (cf. Rajamony & Bianchini tutorial, ICS’02)
Government energy-efficiency requirements Government energy-efficiency requirements • (eg Energy* in US)(eg Energy* in US)
Electricity costs for large ISPs are becoming substantialElectricity costs for large ISPs are becoming substantial Packaging and cooling costs (due to the increase in the power density) Packaging and cooling costs (due to the increase in the power density)
are becoming prohibitiveare becoming prohibitive Power dissipation may reach technology limitsPower dissipation may reach technology limits are becoming are becoming
prohibitiveprohibitive Current delivery is becoming expensiveCurrent delivery is becoming expensive
44
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
MetricsMetricsSome different power metrics & fallacies:Some different power metrics & fallacies:
ReducingReducing power power does not always save energydoes not always save energy EnergyEnergy = = P dt P dt
• If you reduce power but increase execution time, energy If you reduce power but increase execution time, energy may go up may go up
Also note that reducing power does not always Also note that reducing power does not always reduce temperaturereduce temperature
Sustained powerSustained power densitydensity limits thermal limits thermal design/packaging design/packaging – approx. same as approx. same as thermal design powerthermal design power– note that on-chip temperatures and total heat production are note that on-chip temperatures and total heat production are
somewhat different concerns somewhat different concerns
55
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
MetricsMetrics PowerPower
Average powerAverage power Power density mapPower density map
EnergyEnergy Energy (MIPS/W)Energy (MIPS/W) Energy-Delay product (MIPSEnergy-Delay product (MIPS22/W)/W) Energy-DelayEnergy-Delay22 product (MIPS product (MIPS33/W) – /W) – voltage independent!voltage independent!
TemperatureTemperature Average temperatureAverage temperature Peak temperaturePeak temperature Temperature mapTemperature map
• Does not necessarily match power density mapDoes not necessarily match power density map No good figures of merit for trading off thermal efficiency against No good figures of merit for trading off thermal efficiency against
performance, area, or energy efficiencyperformance, area, or energy efficiency
(Zyuban, GVLSI’02)
66
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power DissipationPower Dissipation
Dynamic power dissipationDynamic power dissipation Due to switching activityDue to switching activity
Static power dissipationStatic power dissipation Due to leakage current – major paths are:Due to leakage current – major paths are:
• Subthreshold leakageSubthreshold leakage Exponentially dependent on Vdd, Vth, TempExponentially dependent on Vdd, Vth, Temp
• Gate leakageGate leakage Exponentially dependent on Vdd, ToxExponentially dependent on Vdd, Tox
77
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power DissipationPower Dissipation
Total power actually consists ofTotal power actually consists of Switching powerSwitching power Short-circuit powerShort-circuit power Leakage powerLeakage power
88
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Big Picture - TrendsBig Picture - Trends
Data on current power dissipation for various Data on current power dissipation for various chipschips
Distribution of power within a typical processorDistribution of power within a typical processor Trends in Scaling trends in power dissipationTrends in Scaling trends in power dissipation Trends in leakage powerTrends in leakage power Power Trends in battery lifePower Trends in battery life
99
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power DissipationPower Dissipation
Processor Alpha 21364
AMD Opteron
HP-PA8700
IBM-Power 4
Intel Itanium 2
Intel Xeon
MIPS R14000
Clock Rate
1.15 GHz 2.2 GHz 870 MHz 1.7 GHz 1.5 GHz 3.2 GHz 600 MHz
Power (Max)
110W 86 W 75W 100W 130W 86W 16W
Source: Microprocessor Report
1010
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power Dissipation BreakdownPower Dissipation Breakdown
Alpha 21264Alpha 21264
Global clock network
Instruction issue units
Caches
FP execution units
Int. execution units
Mem. management unit
I/O
Miscellaneous
Source: Gowan et al. “Power Considerations in the design of the alpha 21264 microprocessor”, DAC 1998
1111
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Effects of Technology Scaling onEffects of Technology Scaling onPower DissipationPower Dissipation
Feature size is scaling downFeature size is scaling down 30%30%
Frequency is increasingFrequency is increasing ~2x (Ideal scaling: ~2x (Ideal scaling: decreases decreases by 30%) by 30%)
Area increases due to microarchitecture improvementsArea increases due to microarchitecture improvements 25% (Ideal scaling: 25% (Ideal scaling: decreases decreases byby 50%)50%)
Active capacitance increasesActive capacitance increases at least 30% (Ideal scaling: at least 30% (Ideal scaling: decreases decreases by 30%)by 30%)
Vdd is not scaled down at the same rate as feature sizeVdd is not scaled down at the same rate as feature size 0-10% (Ideal scaling) 30%0-10% (Ideal scaling) 30%
Ideal scaling: P Ideal scaling: P CV CV22f f → 0.7→ 0.722 reduction reduction 0.5 0.5 Observed scaling → 2 – 2.5x Observed scaling → 2 – 2.5x increaseincrease Power density becomes a problem!Power density becomes a problem!
Especially since the power density is non-uniformEspecially since the power density is non-uniform
1212
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power EvolutionPower EvolutionM
ax
Po
we
r (W
att
s)
i386 i386
i486 i486
Pentium® Pentium®
Pentium® w/MMX tech.
Pentium® w/MMX tech.
1
10
100
Pentium® Pro Pentium® Pro
Pentium® II Pentium® II
Pentium® 4Pentium® 4Pentium® 4Pentium® 4
??
Pentium® III Pentium® III
Source: Intel
1313
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Trends in Power DensityTrends in Power DensityW
att
s/c
m2
1
10
100
1000
i386i386i486i486
Pentium® Pentium®
Pentium® ProPentium® Pro
Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate
Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor
RocketRocketNozzleNozzleRocketRocketNozzleNozzle
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Pentium® 4Pentium® 4
1414
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
ITRS ProjectionsITRS ProjectionsYear 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.0 0.9 0.6 0.5 0.4Vdd (low power) (V) 1.1 1.0 0.8 0.7 0.6Frequency (high perf) (GHz) 3.1 5.6 11.5 19.3 28.8
High-perf w/ heatsink 160 180 218 251 288Cost-performance 85 98 120 138 158Hand-held 3.2 3.5 3.0 3.0 3.0
Max power (W)
These are targetsThese are targets Based on historical trends, the high-performance power targets Based on historical trends, the high-performance power targets
seem optimisticseem optimistic Intel papers suggest that in the 45-75W range, cooling costs $1/W; Intel papers suggest that in the 45-75W range, cooling costs $1/W;
but then rate of increase goes up: $2, $3/W, maybe more!but then rate of increase goes up: $2, $3/W, maybe more!(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)
ITRS 2001
1515
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Leakage PowerLeakage Power The fraction of leakage power is increasing The fraction of leakage power is increasing
exponentially with each generationexponentially with each generation Also exponentially dependent on temperatureAlso exponentially dependent on temperature
Static power/ Dynamic Power
010203040506070
Temperature(K)
Pe
rce
nta
ge
180nm 130nm 100nm 90nm 80nm 70nm
Increasingratioacrossgenerations
Source: Skadron et al, University of Virginia
1616
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Trends in Battery TechnologyTrends in Battery Technology
Battery lifetime is increasing perhaps 8-10%/yr.(Powers, Proc. of IEEE 1995)
Not keeping up with rate of growth in energy consumption
Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”, tutorial at PACT 2000
1717
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
RoadmapRoadmap
Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation
Sources, modeling, reduction techniquesSources, modeling, reduction techniques
Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques
SummarySummary
1818
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Power DissipationDynamic Power Dissipation
RoadmapRoadmap Sources of dynamic power dissipationSources of dynamic power dissipation Modeling dynamic powerModeling dynamic power Circuit- and architecture-domain techniques to reduce Circuit- and architecture-domain techniques to reduce
powerpower
1919
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Power ConsumptionDynamic Power Consumption
Power dissipated due to switching activityPower dissipated due to switching activity A capacitance is charged and discharged A capacitance is charged and discharged
Vdd
0 11 0
Charge/discharge at the frequency Charge/discharge at the frequency ffP=CLV2 f
Ec=1/2CLV2
Ed=1/2CLV2
Note that energy consumed from battery is CLV2 and is drawn upon charging
2020
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Power DissipationDynamic Power Dissipation
EquationEquationP = P = a CL Vdd
2 f
a: Activity factor a: Activity factor Depends on the processor architectureDepends on the processor architecture
CCLL: Capacitance of the circuit: Capacitance of the circuit Depends on the design style, number of transistors, Depends on the design style, number of transistors,
transistor sizing, etctransistor sizing, etc
VVdddd: Operating voltage: Operating voltage
f: Frequencyf: Frequency
2121
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Power ModellingDynamic Power Modelling
P = P = a CL V2 f Information neededInformation needed
Activity counters in each unitActivity counters in each unit Energy dissipated per accessEnergy dissipated per access
For precision, “a” (# of signal transitions) should be measured or at For precision, “a” (# of signal transitions) should be measured or at least estimated with a probabilistic modelleast estimated with a probabilistic model
More commonly, a = 0.5 is assumedMore commonly, a = 0.5 is assumed
Performance Performance ModelModel
Power Power ModelModel
ConfigurationConfiguration
ActivityActivity
Performance metricsPerformance metrics Power metricsPower metrics
2222
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Power ModellingDynamic Power Modelling Activity countersActivity counters
Performance model is usedPerformance model is used Counters for: cache access, FU usage, Register File, ...Counters for: cache access, FU usage, Register File, ...
Energy per accessEnergy per access Analytically: calculating capacitances as function of size, ports, etcAnalytically: calculating capacitances as function of size, ports, etc Example: Cache access: decoder, precharge transistors, bitline, cell Example: Cache access: decoder, precharge transistors, bitline, cell
access, wordline, sense amplifiers ...access, wordline, sense amplifiers ...• Wattch (Brooks Wattch (Brooks et alet al, ISCA 2000, ISCA 2000))• Cacti Cacti
Empirically: using low level designs and applying “virus” testsEmpirically: using low level designs and applying “virus” tests• Virus test: microbenchmark that stresses a particular unitVirus test: microbenchmark that stresses a particular unit• ALPS (Gunther ALPS (Gunther et al,et al, ITJ, 2001) ITJ, 2001)
Circuit-extracted modelCircuit-extracted model PowerTimer – IBM Power4 (Brooks et al, PACS’00)PowerTimer – IBM Power4 (Brooks et al, PACS’00) AccuPower – Parameterized, based on SPICE measurements of actual AccuPower – Parameterized, based on SPICE measurements of actual
layouts (SUNY Binghamton, Ponomarev et al, DATE’02)layouts (SUNY Binghamton, Ponomarev et al, DATE’02) PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)
Many of these ignore the actual number of signal transitionsMany of these ignore the actual number of signal transitions
2323
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Circuit-Level Techniques Circuit-Level Techniques
Transistor sizingTransistor sizing Signal and clock gatingSignal and clock gating Circuit restructuringCircuit restructuring Low power cachesLow power caches Low power register filesLow power register files Issue queueIssue queue
These typically reduce the capacitance being These typically reduce the capacitance being switchedswitched
2424
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Transistor SizingTransistor Sizing
Transistor sizing plays an important role to reduce powerTransistor sizing plays an important role to reduce power
Delay ~ Delay ~ (k / ln K)(k / ln K) Power ~ K / (K-1)Power ~ K / (K-1) Optimum K for both power and delay must be pursuedOptimum K for both power and delay must be pursued
C0 C1 CN-1 CN
K = Ci/Ci-1
2525
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Signal GatingSignal Gating
ImplementationImplementation Simple gateSimple gate Tristate bufferTristate buffer ......
Control signal neededControl signal needed Generation requires additional logicGeneration requires additional logic
Identification of signals to be gatedIdentification of signals to be gated Clock Clock Address busAddress bus
Also helps to prevent power dissipation due to glitchesAlso helps to prevent power dissipation due to glitches
““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””
signal
ctrl
Output
2626
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Clock GatingClock Gating
ImplementationImplementation Simple gate that replacesSimple gate that replaces
one buffer in the clock treeone buffer in the clock tree Delay is generally not a concernDelay is generally not a concern
DecisionDecision Architectural levelArchitectural level
““Disabling a functional block when it is not required for a extended Disabling a functional block when it is not required for a extended periodperiod””
signal
ctrl
functionalunit
functionalunit
2727
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Circuit RestructuringCircuit Restructuring
Pipeline (can reduce frequency)Pipeline (can reduce frequency) Parallelize (can reduce frequency)Parallelize (can reduce frequency) Reorder inputs so that most active input is Reorder inputs so that most active input is
closest to output (reduces switched capacitance)closest to output (reduces switched capacitance) Restructure gates (equivalent functions are not Restructure gates (equivalent functions are not
equivalent in switched capacitance)equivalent in switched capacitance) Energy-efficient flip-flops and latchesEnergy-efficient flip-flops and latches
2828
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Cache DesignCache Design
CCaccessaccess = R = R C C C Ccellcell
Reducing powerReducing power Switched capacitanceSwitched capacitance Voltage swingVoltage swing Activity factorActivity factor FrequencyFrequency
sens amp
Column dec
row
dec bi
tline
bitli
neR rowsC cols
0
10
20
30
40
50
60
70
80
Decode
r
Wlin
es
TBLSA
DBLSA
I/O b
uses
Oth
er
Read
Write
TBLSA: Tagbitlines & sense amp.DBLSA: Data bitlines and sense amp.
Cache parameters: 16 KB cache 0.25 μm
wordline
Villa et al, MICRO 2000
2929
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Cache DesignCache Design Banked organizationBanked organization
Targets switched capacitanceTargets switched capacitance CCaccessaccess = R = R C C C Ccellcell / B/ B
Dividing word line Dividing word line Same effect for wordlinesSame effect for wordlines
Reducing voltage swingsReducing voltage swings Sense amplifiers used to detect VSense amplifiers used to detect Vdiffdiff across bitlines across bitlines Read operation can be curtailed as soon as VRead operation can be curtailed as soon as Vdiff diff is detectedis detected Limiting voltage swing saves a fraction of powerLimiting voltage swing saves a fraction of power
Pulse word linesPulse word lines Enabling the word line for the time needed to discharge bitcell Enabling the word line for the time needed to discharge bitcell
voltagevoltage Designer needs to estimate access time and implement a pulse Designer needs to estimate access time and implement a pulse
generatorgenerator
3030
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Low Power Register File DesignLow Power Register File Design
RF’s usually single-ended bitlinesRF’s usually single-ended bitlines Modified storage cellModified storage cell
Lot of zeros fetched from the RFLot of zeros fetched from the RF Bitline connections are modified to eliminate bitline discharge Bitline connections are modified to eliminate bitline discharge
when reading a zerowhen reading a zero
Tseng and Asanovic, ICSD, 2000Zyuban and Kogge, ISLPED 1998
3131
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Efficient Issue QueueEfficient Issue Queue
Constitute a high fraction of the overall powerConstitute a high fraction of the overall power >25% for some authors>25% for some authors
Tag 1Tag w
compOR OR
comp
comp
comp
RDY Oprnd RDYOprnd
3232
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Efficient Issue QueueEfficient Issue Queue
Useful comparisonUseful comparison Empty entries and ready entries consume energyEmpty entries and ready entries consume energy
• Wakeup of empty entries can be disabledWakeup of empty entries can be disabled Gating off precharge logic using valid bitGating off precharge logic using valid bit
• Wakeup of ready sources can be disabledWakeup of ready sources can be disabled Gating off precharge logic using ready bitGating off precharge logic using ready bit
Folegnani and Gonzalez, ISCA 2001Folegnani and Gonzalez, ISCA 2001
Energy-efficient ComparatorsEnergy-efficient Comparators Traditional comparators dissipate energy on a mismatch in any Traditional comparators dissipate energy on a mismatch in any
bit position.bit position. 10%-20% of source operands match each cycle10%-20% of source operands match each cycle Solution: comparators that dissipate energy in a matchSolution: comparators that dissipate energy in a match
Kuckuc Kuckuc et alet al, ISLPED 2001, ISLPED 2001
3333
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Architectural-Level TechniquesArchitectural-Level Techniques
Encoding/compressionEncoding/compression Energy-efficient front endEnergy-efficient front end Energy-efficient cachesEnergy-efficient caches Asymmetric processorsAsymmetric processors Dynamic Voltage/Frequency scalingDynamic Voltage/Frequency scaling Multi clock domain architectures (similar to GALS)Multi clock domain architectures (similar to GALS) Pipeline gatingPipeline gating Compiler techniquesCompiler techniques Sleep modesSleep modes
These typically take advantage of locality or slackThese typically take advantage of locality or slack
3434
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Bus Invert EncodingBus Invert Encoding
Reduce power of parallel synchronous signalsReduce power of parallel synchronous signals Idea: Minimize the number of transitions Idea: Minimize the number of transitions
• (Stan & Burleson, IEEE Trans. on VLSI, 1995)(Stan & Burleson, IEEE Trans. on VLSI, 1995) Sender examines the current and the next valuesSender examines the current and the next values Decides whether sending the true or the compliment signalDecides whether sending the true or the compliment signal Additional polarity signal is sent along with dataAdditional polarity signal is sent along with data ExampleExample
000100110
110011101Current data
Next data
Number oftransitions
8
NOT (Next data) 111011001
Number oftransitions
2
110011101Current data
3535
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Zero CompressionDynamic Zero Compression Zero Indicator Bit (ZIB) added to each byteZero Indicator Bit (ZIB) added to each byte
Enabled if a zero is stored in cacheEnabled if a zero is stored in cache On a read access, bitline discharge is prevented by disabling On a read access, bitline discharge is prevented by disabling
local wordlinelocal wordline On a write, if the byte is zero, just ZIB is written.On a write, if the byte is zero, just ZIB is written.
Circuit ModificationsCircuit Modifications Zero-detection and store bus driversZero-detection and store bus drivers Wordline gating: 8-bit data is driven by the associated ZIBWordline gating: 8-bit data is driven by the associated ZIB Sense Amps: modified to drive a zero if ZIB activeSense Amps: modified to drive a zero if ZIB active
DrawbacksDrawbacks 9% area increase, 2-gate delay increase9% area increase, 2-gate delay increase
ResultsResults 26% energy reduction data cache, 10% instruction cache26% energy reduction data cache, 10% instruction cache
Villa et al, MICRO 2000
3636
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Exploiting Narrow Width OperandsExploiting Narrow Width Operands High percentage of integer operations require <16 bitsHigh percentage of integer operations require <16 bits
Difficult for the compiler to know the actual operand sizeDifficult for the compiler to know the actual operand size Variability for the same instruction in successive instancesVariability for the same instruction in successive instances
Clock Gating is used to partially disable the FUClock Gating is used to partially disable the FU
zero48
clkAND
Highlatch
Lowlatch
OperandA
zero48
clk AND
Highlatch
Lowlatch
OperandB
Inte
ger
FU
Zerodetec
AND
64
64
0-15
1
16-63
64
64
0Result
zero48
zero48
clkAND
Highlatch
Lowlatch
OperandA
zero48
clk AND
Highlatch
Lowlatch
OperandB
Inte
ger
FU
Zerodetec
AND
64
64
0-15
1
16-63
64
64
0Result
zero48
Brooks and Martonosi, HPCA 1999
3737
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Energy-Efficient Front End: Energy-Efficient Front End: Branch PredictionBranch Prediction
Branch PredictionBranch Prediction Parikh et al, HPCA’02, IEEE Trans. Computers ‘04Parikh et al, HPCA’02, IEEE Trans. Computers ‘04 Branch prediction accuracy is a major determinant of Branch prediction accuracy is a major determinant of
pipeline activity -> spending pipeline activity -> spending more powermore power in the branch in the branch predictor can be worthwhile if it improves accuracypredictor can be worthwhile if it improves accuracy
Branch predictors can be designed to reduce power, egBranch predictors can be designed to reduce power, eg• BankingBanking• Gate off unnecessary accesses (“prediction probe detector”)Gate off unnecessary accesses (“prediction probe detector”)
3838
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Energy Efficient Front End:Energy Efficient Front End:Register RenamingRegister Renaming
RAT often implemented as a multiported register file RAT often implemented as a multiported register file indexed by logical register, returns physical registerindexed by logical register, returns physical register
Liu and Lu , MICRO’00Liu and Lu , MICRO’00 Hierarchical RAT- top level is a cache of the full tableHierarchical RAT- top level is a cache of the full table
Kucuk et al, PATMOS’03Kucuk et al, PATMOS’03 Prevent lookup of sources that will be supplied by a freshly Prevent lookup of sources that will be supplied by a freshly
renamed instruction in the same rename grouprenamed instruction in the same rename group Filter cacheFilter cache
Could instead organize as an associative lookup in a Could instead organize as an associative lookup in a table organized by physical register with dissipate-on-table organized by physical register with dissipate-on-match comparator (Ergin et al, ICCD’02)match comparator (Ergin et al, ICCD’02)
3939
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Energy-Efficient CachesEnergy-Efficient Caches
Filter cacheFilter cache Small L0 cache filters many accesses to L1, allows an L1 with Small L0 cache filters many accesses to L1, allows an L1 with
fewer ports (Kin et al, MICRO-30)fewer ports (Kin et al, MICRO-30) BanksBanks Selective cache ways (Albonesi, MICRO-32)Selective cache ways (Albonesi, MICRO-32)
Ways in a set associative cache can be disabled if not neededWays in a set associative cache can be disabled if not needed Many variations of this approachMany variations of this approach
Staggering number of papers on this topicStaggering number of papers on this topic Exploit victim cache, load-store queueExploit victim cache, load-store queue Clever cache organizations (eg combining banks w/ high assoc, Clever cache organizations (eg combining banks w/ high assoc,
specialized caches, etc.)specialized caches, etc.) See recent proceedings of VLSI, architecture conferences, See recent proceedings of VLSI, architecture conferences,
esp. ISLPEDesp. ISLPED
4040
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Asymmetric ProcessorsAsymmetric Processors
Processors have different “versions” of the same Processors have different “versions” of the same resource, with different power/latencyresource, with different power/latency
Fast, power-hungry resources are allocated to critical Fast, power-hungry resources are allocated to critical instructionsinstructions
Slow, low-power resources are allocated to non-critical Slow, low-power resources are allocated to non-critical instructionsinstructions
Criticality predictor is needed!!!Criticality predictor is needed!!!
4141
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Asymmetric ProcessorsAsymmetric Processors
Reducing power of functional unitsReducing power of functional units 2 sets of functional units2 sets of functional units 2 sets of instruction queues2 sets of instruction queues Criticality predictorCriticality predictor
Critical instructionsCritical instructions In-order queue: critical path is usually a serial chain of In-order queue: critical path is usually a serial chain of
dependent instructionsdependent instructions Fast functional unitsFast functional units
Non-critical instructionsNon-critical instructions OoO queueOoO queue Slow functional unitsSlow functional units
Seng et al, MICRO 2001
4242
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dual Speed PipelinesDual Speed Pipelines
Slow pipeline works at half the frequencySlow pipeline works at half the frequency Criticality predictor key component to keep energy-efficiencyCriticality predictor key component to keep energy-efficiency No communications penaltiesNo communications penalties
Fet
ch
Dec
ode
Slow pipeline
Fast pipeline
RegFile C
omm
it
Criticalitypredictor
Pyreddy and Tyson, WCED 2001
4343
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Voltage/Frequency ScalingDynamic Voltage/Frequency Scaling
Allow the device to dynamically adapt the voltage (and the Allow the device to dynamically adapt the voltage (and the frequency)frequency)
P ~ VP ~ Vdddd22
F ~ VF ~ Vdddd/(V/(Vdddd-V-Vthth))kk
Tradeoff between power reductions and delay increaseTradeoff between power reductions and delay increase MUST BE energy-efficientMUST BE energy-efficient
Already implemented in many processorsAlready implemented in many processors ImplementationImplementation
Voltage regulatorVoltage regulator Predict future processor utilization and adjust frequency/voltage to Predict future processor utilization and adjust frequency/voltage to
maximize power reduction while keeping performancemaximize power reduction while keeping performance
4444
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
TransmetaTransmetaTMTM LongRun LongRunTMTM
Crusoe processor can configure itselfCrusoe processor can configure itself**
Voltage changes in steps of 25 mV (depending on the voltage Voltage changes in steps of 25 mV (depending on the voltage regulator)regulator)
Frequency changes in steps of 33 MHzFrequency changes in steps of 33 MHz From 1.6v, 600 MHz to 1.2V, 300MHz (2001)From 1.6v, 600 MHz to 1.2V, 300MHz (2001)
ManagementManagement Implemented in the Code MorphingImplemented in the Code MorphingTMTM software layer software layer Idle time of the system is sampled to determine performance Idle time of the system is sampled to determine performance
demandsdemands
Thermal extensionThermal extension May be a form of thermal throttlingMay be a form of thermal throttling Expands the thermal budget of the processorExpands the thermal budget of the processor
* Source: http://www.transmeta.com
4545
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
TransmetaTransmeta™™LongRunLongRun™™
Idle timeIdle time Voltage drops to minimumVoltage drops to minimum
On-line activityOn-line activity Voltage raises to maximumVoltage raises to maximum
Real-Time activityReal-Time activity Voltage adjusted to meet Voltage adjusted to meet
requirementsrequirements DVD playerDVD player
• 24 frames/second24 frames/second
Source: Transmeta
4646
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
IntelIntel SpeedStepSpeedStep®®
ConfigurationConfiguration**
From 0.844v (600MHz) to 1.48v (1.7 GHz)From 0.844v (600MHz) to 1.48v (1.7 GHz) 100100μμs delays delay Voltage-Frequency switching separationVoltage-Frequency switching separation
* Source: http://www.intel.com
No Change
Freq. Transition
Volt. Transition
Volt. Transition
Freq. Transition
4747
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
IntelIntel SpeedStepSpeedStep®®
ConfigurationConfiguration Clock partitioningClock partitioning
• Core clockCore clock
• Bus clock (sequencer and interrupt interface)Bus clock (sequencer and interrupt interface) Event blockingEvent blocking
• Interrupts, pin events and snoop requests are not lostInterrupts, pin events and snoop requests are not lost
4848
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Voltage SchedulingVoltage Scheduling
Real-time problem will be discussed laterReal-time problem will be discussed later For non-real time workload, goal is to improve For non-real time workload, goal is to improve
energy efficiencyenergy efficiency This is hard, because it is difficult to predict an This is hard, because it is difficult to predict an
arbitrary workload’s future needs without arbitrary workload’s future needs without deadline informationdeadline information
Instead, try to schedule processes and voltages Instead, try to schedule processes and voltages to reduce idle timeto reduce idle time eg, Weiser et al, OSDI-1eg, Weiser et al, OSDI-1
4949
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Sleep Modes Sleep Modes
ACPI: Advance Configuration and Power InterfaceACPI: Advance Configuration and Power Interface Developed by Microsoft, HP, Toshiba, Phoenix and IntelDeveloped by Microsoft, HP, Toshiba, Phoenix and Intel
Establishes interfaces for OS-directed power-Establishes interfaces for OS-directed power-managementmanagement
Replaces APM, MPS APIs and PnP BIOSReplaces APM, MPS APIs and PnP BIOS DefinesDefines
Hardware registersHardware registers BIOS interfacesBIOS interfaces System and device power statesSystem and device power states
Source: ACPI overview, http://www.acpi.info
5050
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
DVS “Critical Power Slope”DVS “Critical Power Slope”
It may be more efficient It may be more efficient notnot to use DVS, and to to use DVS, and to run at the highest possible frequency, then go run at the highest possible frequency, then go into a sleep mode!into a sleep mode! Depends on power dissipation in sleep modeDepends on power dissipation in sleep mode And power dissipation at lowest voltageAnd power dissipation at lowest voltage
This has been formalized as the critical power This has been formalized as the critical power slope (Miyoshi et al, ICS’02):slope (Miyoshi et al, ICS’02): mmcriticalcritical = (P = (Pffminmin
– P – Pidleidle) / f) / fminmin
If the actual slope m = (PIf the actual slope m = (Pff - P - Pffminmin) / (f – f) / (f – fminmin) < m) < mcriticalcritical
then it is more energy efficient to run at the highest then it is more energy efficient to run at the highest frequency, then go to sleepfrequency, then go to sleep
Switching overheads must be taken into accountSwitching overheads must be taken into account
5151
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
Multiple clock domains inside the processorMultiple clock domains inside the processor Globally-asynchronous locally synchronous Globally-asynchronous locally synchronous
(GALS) clock style(GALS) clock style IndependentIndependent voltage/frequency scaling voltage/frequency scaling Synchronizers to ensure inter-domain Synchronizers to ensure inter-domain
communicationcommunication
5252
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
AdvantagesAdvantages Local clock design is not aware of global skewLocal clock design is not aware of global skew Each domain limited by its local critical path, allowing higher Each domain limited by its local critical path, allowing higher
frequenciesfrequencies Different voltage regulators allow for a finer-grain energy controlDifferent voltage regulators allow for a finer-grain energy control Frequency/voltage of each domain can be tailored to its dynamic Frequency/voltage of each domain can be tailored to its dynamic
requirementsrequirements Clock Power is reducedClock Power is reduced
DrawbacksDrawbacks Complexity and penalty of synchronizersComplexity and penalty of synchronizers Feasibility of multiple voltage regulatorsFeasibility of multiple voltage regulators
5353
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
Src runs with Src runs with CLKCLK11, dst , dst
with with CLKCLK22
Src writes at Src writes at TT11
If If TT > > TTss then dst can use then dst can use
the data at the data at TT22
If If TT < < TTss then dst can use then dst can use
the data at the data at TT33
T
CLK1
CLK2
1
2 3
4
Semeraro et al, ISCA 2003
SynchronizationSynchronization
5454
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
Domains must be carefully chosenDomains must be carefully chosen Small cost on communicationsSmall cost on communications Re-using existing structuresRe-using existing structures
ExampleExample 5 domains5 domains
• Front-endFront-end• Integer unitInteger unit• FP unitFP unit• On-chip cache unitOn-chip cache unit• Main memoryMain memory
5555
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
L2unifiedcache
L1d-cache
LSQ
MemoryFront-end
branchpredict renameL1
i-cache
fetch dispatchIFQ
int.registerfile
int.FUs
IIQ
Integer
fp.registerfile
fp.FUs
FIQ
Floating Point
MainMemory
CPU
Magklis et al, ISCA 2003
5656
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
Dynamic voltage/frequency scaling in each domainDynamic voltage/frequency scaling in each domain Reconfiguration points must be chosenReconfiguration points must be chosen
Off-line “shaker” algorithmOff-line “shaker” algorithm• Aggressive oracle algorithm with good resultsAggressive oracle algorithm with good results• Uses detailed dynamic execution trace to find frequenciesUses detailed dynamic execution trace to find frequencies• It is not practical, requires future knowledge of this precise dynamic It is not practical, requires future knowledge of this precise dynamic
runrun On-line Attack-decayOn-line Attack-decay
• Interval-based hardware algorithmInterval-based hardware algorithm• Transparent to the application, minimal overheadTransparent to the application, minimal overhead• More conservative, achieves 75% efficiency of off-lineMore conservative, achieves 75% efficiency of off-line
Profile-basedProfile-based• Use profiling to associate frequencies with parts of the codeUse profiling to associate frequencies with parts of the code• When these points in the code are reached during a dynamic run When these points in the code are reached during a dynamic run
then change frequenciesthen change frequencies
5757
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Gating/ThrottlingGating/Throttling
Gating: Disable some of the stages of the processorGating: Disable some of the stages of the processor To reduce useless activity: after a branch mispredictionTo reduce useless activity: after a branch misprediction
Manne et al, ISCA 1998Manne et al, ISCA 1998 Effectiveness is heavily dependent on accuracy of branch Effectiveness is heavily dependent on accuracy of branch
confidence predictor confidence predictor Parikh et al, HPCA’02Parikh et al, HPCA’02
Throttling: Slow down some processor stage when it is Throttling: Slow down some processor stage when it is predicted that the performance predicted that the performance will notwill not be reduced be reduced
Branch mispredictionBranch misprediction Long latency load missLong latency load miss IPC reduction in generalIPC reduction in general
Baniasadi and Moshovos, ISLPED 2001Baniasadi and Moshovos, ISLPED 2001
5858
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Selective Throttling for Control SpeculationSelective Throttling for Control Speculation
Control Speculation increases power dissipation (28%)Control Speculation increases power dissipation (28%) Energy wasted by mispredicted instructionsEnergy wasted by mispredicted instructions
Selective throttling of fetch/decodeSelective throttling of fetch/decode Based on branch confidenceBased on branch confidence
Gating of selection stage Gating of selection stage Instructions that likely belong to a mispredicted pathInstructions that likely belong to a mispredicted path
9% Energy-Delay improvement9% Energy-Delay improvement
Aragon et al, HPCA 2003
5959
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Co-Adaptive Instruction Fetch and IssueCo-Adaptive Instruction Fetch and Issue
Fetch gating based on issue queue utilizationFetch gating based on issue queue utilization Rather than using instruction window usageRather than using instruction window usage
Fetch is stopped if Fetch is stopped if close parallelism close parallelism is presentis present Just instructions from the head of the IQ are issuedJust instructions from the head of the IQ are issued To match the size of the window residing in the IQ to To match the size of the window residing in the IQ to
application’s ILPapplication’s ILP
Fetch gating combined with dynamic issue queue Fetch gating combined with dynamic issue queue adaptationadaptation
20% energy-delay improvement20% energy-delay improvement
Buyuktosunoglu et al, ISCA 2003
6060
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Compiler Techniques for Low PowerCompiler Techniques for Low Power
Good reference: tutorial by Kremer, PLDI’03Good reference: tutorial by Kremer, PLDI’03 Traditional compiler optimizations often improve Traditional compiler optimizations often improve
energy efficiency energy efficiency eg, register allocation, CSE, tiling for cache hit rateeg, register allocation, CSE, tiling for cache hit rate
But some compiler optimizations waste energyBut some compiler optimizations waste energy eg, aggressive speculationeg, aggressive speculation
Energy efficiency of code sequences is highly Energy efficiency of code sequences is highly dependent on microarchitecturedependent on microarchitecture eg, free slot in a VLIW wordeg, free slot in a VLIW word
6161
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Compiler Techniques for Low Power, cont.Compiler Techniques for Low Power, cont.
Compiler-guided DVSCompiler-guided DVS v1: reduce voltage while meeting real-time deadlinesv1: reduce voltage while meeting real-time deadlines v2: reduce voltage in memory-bound program regionsv2: reduce voltage in memory-bound program regions
• Hsu and Kremer, ISLPED’01, PLDI’03Hsu and Kremer, ISLPED’01, PLDI’03• Xie et al, PLDI’03Xie et al, PLDI’03
Dynamic resource configuration/hibernationDynamic resource configuration/hibernation Deactivate modules when they won’t be used for a long time (>> Deactivate modules when they won’t be used for a long time (>>
sleep/wakeup time)sleep/wakeup time)• Heath et al, PACT’02Heath et al, PACT’02
Profile/compiler-guided adaptationProfile/compiler-guided adaptation eg,profile-guided MCD adaptation mentioned earlier (Magklis et eg,profile-guided MCD adaptation mentioned earlier (Magklis et
al, ISCA’03)al, ISCA’03) eg, subroutine-guided (“positional”) adapation (Huang et al, eg, subroutine-guided (“positional”) adapation (Huang et al,
ISCA’03)ISCA’03)• Uses a hierarchy of low-power modesUses a hierarchy of low-power modes
Much work in this area – this only touches the surfaceMuch work in this area – this only touches the surface
6262
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Power Savings for Real Time SystemsPower Savings for Real Time Systems
Soft vs. hard real timeSoft vs. hard real time Periodic vs. aperiodicPeriodic vs. aperiodic
Periodic tasks are especially important in control systemsPeriodic tasks are especially important in control systems
Most work has focused on DVS schedulingMost work has focused on DVS scheduling ExamplesExamples
MPEG playbackMPEG playback Web serverWeb server
6363
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
DVS for Multimedia AppsDVS for Multimedia Apps(soft real-time approach)(soft real-time approach)
MM apps must process every frame within a time limitMM apps must process every frame within a time limit If If idleidle time, then there is some time, then there is some slackslack IPC is constant across frames of the same typeIPC is constant across frames of the same type
Slow down the processor to meet deadlinesSlow down the processor to meet deadlines 2 Phases2 Phases
Profiling Profiling • Determines max. number of insts. can be executed for each confDetermines max. number of insts. can be executed for each conf
• Sorts that listSorts that list AdaptationAdaptation
• Predicts the number of instructions to be executed in the next intervalPredicts the number of instructions to be executed in the next interval
• Uses the lowest energy hardware configuration that fulfills Uses the lowest energy hardware configuration that fulfills requirementsrequirements
Hughes et al MICRO 2001
6464
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
DVS for Multimedia AppsDVS for Multimedia Apps(hard real-time approach)(hard real-time approach)
Buffering decoded frames provides a Buffering decoded frames provides a control point to enforce deadlines using control point to enforce deadlines using feedback controlfeedback control Dead-zone proportional-integral controller sets Dead-zone proportional-integral controller sets
DVS to maintain queue occupancyDVS to maintain queue occupancy No profiling or other prior knowledge about No profiling or other prior knowledge about
stream is neededstream is needed If queue becomes empty, “panic” model forces If queue becomes empty, “panic” model forces
highest speedhighest speed
Lu et al ICCD 2003
deadzone
increasefrequency
decreasefrequency
6565
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
DVS for Web ServersDVS for Web Servers
Basic idea: load balance, then do DVS to Basic idea: load balance, then do DVS to reclaim slack (Elnozahy et al, PACS’02)reclaim slack (Elnozahy et al, PACS’02) But it may be more profitable to cluster requests onto But it may be more profitable to cluster requests onto
fewer nodes and put some to sleepfewer nodes and put some to sleep Even on single nodes, it may be profitable to Even on single nodes, it may be profitable to
briefly defer requests, then batch them at the briefly defer requests, then batch them at the highest frequency before going to sleep highest frequency before going to sleep (Elnozahy et al, USITS’03)(Elnozahy et al, USITS’03)
To provide delay guarantees requires feedback To provide delay guarantees requires feedback control (control (Sharma et al RTSS 2001)Sharma et al RTSS 2001) A natural and effective control point is A natural and effective control point is synthetic synthetic
utilizationutilization• Combines true utilization with real-time schedulabilityCombines true utilization with real-time schedulability
6666
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Other ApproachesOther Approaches
Almost all RT algorithms attempt to reclaim slackAlmost all RT algorithms attempt to reclaim slack
EpisodeEpisode detection (Flautner et al, MOBICOM’01) detection (Flautner et al, MOBICOM’01) Identify interactive and periodic events, schedule accordinglyIdentify interactive and periodic events, schedule accordingly
Program checkpoints – check performance relative to Program checkpoints – check performance relative to deadline and adjust DVS accordinglydeadline and adjust DVS accordingly
Exploit direct knowledge of task execution times or Exploit direct knowledge of task execution times or utilizationutilization
VISA (Anantaraman et al, ISCA’03)VISA (Anantaraman et al, ISCA’03) Model a superscalar (unpredictable processor) as a predictable Model a superscalar (unpredictable processor) as a predictable
scalar processor to perform RT analysis and scheduling, then scalar processor to perform RT analysis and scheduling, then reduce DVS setting when superscalar processor runs faster than reduce DVS setting when superscalar processor runs faster than predictedpredicted
Use program checkpoints to check progress/slackUse program checkpoints to check progress/slack
6767
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Short-Circuit PowerShort-Circuit Power
Main solutions are Main solutions are Reduce rise/fall timesReduce rise/fall times
• Tradeoff: reducing rise/fall times requires stronger drivers, Tradeoff: reducing rise/fall times requires stronger drivers, more dynamic powermore dynamic power
Reduce capacitance being switchedReduce capacitance being switched
6868
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
RoadmapRoadmap
Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation
Sources, modeling, reduction techniquesSources, modeling, reduction techniques
Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques
SummarySummary
6969
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Static Power DissipationStatic Power Dissipation
Static power: dissipation due to leakage currentStatic power: dissipation due to leakage current Growing worse because VGrowing worse because Vthth is not scaling as fast is not scaling as fast
as Vas Vdddd
RoadmapRoadmap Most important sources of static power: subthreshold Most important sources of static power: subthreshold
leakage and gate leakageleakage and gate leakage Inter-process variationInter-process variation TrendsTrends ModelingModeling leakage power leakage power Circuit/architectural-level techniquesCircuit/architectural-level techniques
7070
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Static PowerStatic Power
Main mechanisms for leakage currentMain mechanisms for leakage current Subthreshold (Berkely predictive model):Subthreshold (Berkely predictive model):
GateGate• IIgategate = I = Igate0 gate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0)) * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0))
We will focus on subthresholdWe will focus on subthreshold
Gate leakage has essentially been ignored Gate leakage has essentially been ignored New gate insulation materials may solve problem, eg recent Intel New gate insulation materials may solve problem, eg recent Intel
announcementannouncement• R. Chau, Technology@intel Magazine. www.intel.comR. Chau, Technology@intel Magazine. www.intel.com
Gate-induced drain leakage (GIDL) occurs at negative gate voltages Gate-induced drain leakage (GIDL) occurs at negative gate voltages and high Vdd or high values of reverse body biasand high Vdd or high values of reverse body bias
t
offthv
V
tVddVddba
OXleakage vn
VVeve
L
WCI t
dd
02)*(0 exp10
7171
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Effects of Parameter VariationsEffects of Parameter Variations
IIoffoff depends exponentially on V depends exponentially on V thth
There is a large fluctuation of IThere is a large fluctuation of Ioffoff from die to die and from gate to gate from die to die and from gate to gate Controlling VControlling Vthth is difficult in nanometer scale is difficult in nanometer scale
Drain-induced barrier loweringDrain-induced barrier lowering• Channel length is not constantChannel length is not constant• Exacerbated in sub-100nm devicesExacerbated in sub-100nm devices
Discrete dopant effectsDiscrete dopant effects• In a very small channel, small number of dopantsIn a very small channel, small number of dopants• Presence of these dopants and random fluctuation of their number, lead to Presence of these dopants and random fluctuation of their number, lead to
changes in Vchanges in Vthth from device to device from device to device
Process variation affectsProcess variation affects Gate length (LGate length (Ldrawndrawn)) Gate oxide thickness (TGate oxide thickness (Toxox)) Channel dose (NChannel dose (Nsubsub))
Srivastava et al, ISLPED 2002
7272
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Static PowerStatic Power MotivationMotivation
Growing relative to dynamic power dissipation: soon 50% of total Growing relative to dynamic power dissipation: soon 50% of total powerpower
Exponentially dependent on Temp, Vth, VddExponentially dependent on Temp, Vth, Vdd Natural target for optimization: idle transistorsNatural target for optimization: idle transistors
Static power/ Dynamic Power
010203040506070
Temperature(K)
Pe
rce
nta
ge
180nm 130nm 100nm 90nm 80nm 70nm
Increasingratioacrossgenerations
Source: Skadron et al, University of Virginia
7373
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Static PowerStatic Power Modeling LeakageModeling Leakage
Butts and Sohi (MICRO-33)Butts and Sohi (MICRO-33)• PPstaticstatic = V = Vcccc · N · k· N · kdesigndesign · Î · Îleakleak
• ÎÎleakleak determined by circuit simulation, k determined by circuit simulation, kdesigndesign empirically empirically• Key contribution: separate technology from designKey contribution: separate technology from design
HotLeakage (UVA TR CS-2003-05, DATE’04)HotLeakage (UVA TR CS-2003-05, DATE’04)• Extension of Butts & Sohi approach: scalable with VExtension of Butts & Sohi approach: scalable with Vdddd, V, Vthth, ,
Temp, and technology node; adds gate leakageTemp, and technology node; adds gate leakage• ÎÎleak leak determined by BSIM3 subthreshold equation and BSIM4 determined by BSIM3 subthreshold equation and BSIM4
gate-leakage equations, giving an analytical expression that gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at accounts for dependence on factors that may change at runtime, namely Vruntime, namely Vdddd, V, Vthth, and Temp , and Temp
• kkdesigndesign replaced by separate factors for N- and P-type transistors replaced by separate factors for N- and P-type transistors• kkdesigndesign also exponentially dependent on also exponentially dependent on VVdd dd and Tand Toxox, linearly , linearly
dependent on Tempdependent on Temp• Currently integrated with SimpleScalar/Wattch for cachesCurrently integrated with SimpleScalar/Wattch for caches
7474
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Static PowerStatic Power
Modeling Leakage (cont.)Modeling Leakage (cont.) Su et al, IBM (ISLPED’03)Su et al, IBM (ISLPED’03)
• Similar approach to HotLeakage – but they observe that Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the modeling the change in leakage allows linearization of the equationsequations
Many, many other papers on various aspects of Many, many other papers on various aspects of modeling different aspects of leakagemodeling different aspects of leakage
• Most focus on subthresholdMost focus on subthreshold• Few suggest how to model leakage in microarchitecture Few suggest how to model leakage in microarchitecture
simulationssimulations
7575
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Circuit/architectural level techniquesCircuit/architectural level techniques
Transistor sizingTransistor sizing Dual VDual Vthth
DVSDVS Dynamic threshold voltage – reverse body biasDynamic threshold voltage – reverse body bias Sleep transistorsSleep transistors Low leakage caches/branch predictorsLow leakage caches/branch predictors Low leakage register fileLow leakage register file Low leakage issue queueLow leakage issue queue Low leakage ALUsLow leakage ALUs Techniques for reducing gate leakageTechniques for reducing gate leakage What else?What else?
7676
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Transistor sizing, Dual-VthTransistor sizing, Dual-Vth
Transistor sizingTransistor sizing Reducing W/L reduces leakage: use smallest possible Reducing W/L reduces leakage: use smallest possible
transistorstransistors Leakage-performance tradeoffLeakage-performance tradeoff
Dual-VthDual-Vth High-threshold transistors dramatically reduce High-threshold transistors dramatically reduce
leakage: use low-Vth on critical paths, high-Vth leakage: use low-Vth on critical paths, high-Vth elsewhereelsewhere
Often suggested in caches: many possible Often suggested in caches: many possible permutationspermutations
DVSDVS Leakage is exponentially dependent on Vdd, soLeakage is exponentially dependent on Vdd, so
DVS reduces leakageDVS reduces leakage
7777
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Dynamic Threshold VoltageDynamic Threshold Voltage
Adjust threshold voltage dynamicallyAdjust threshold voltage dynamically Also called reverse body bias (RBB), auto backgate-Also called reverse body bias (RBB), auto backgate-
controlled multi-threshold CMOS (ABB-MTCMOS) controlled multi-threshold CMOS (ABB-MTCMOS) (Nii et al, ISPLED’98)(Nii et al, ISPLED’98)
Apply negative voltage to body: requires larger VApply negative voltage to body: requires larger VGSGS to to establish channel, so it raises Vthestablish channel, so it raises Vth
Engage RBB for idle transistorsEngage RBB for idle transistors Preserves statePreserves state Requires twin-well process; more expensive to Requires twin-well process; more expensive to
manufacturemanufacture Limited by GIDLLimited by GIDL Can also be used at testing to adjust circuit properties Can also be used at testing to adjust circuit properties
and reduce parameter variationsand reduce parameter variations
7878
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Sleep TransistorsSleep Transistors
Add a high-Vth transistor between the Add a high-Vth transistor between the circuit and either/both power rails – the circuit and either/both power rails – the sleep transistorsleep transistor
Also referred to as a “header” (to Vdd) or Also referred to as a “header” (to Vdd) or “footer” (to ground)“footer” (to ground)
The high-Vth transistor cuts off most The high-Vth transistor cuts off most leakageleakage
In fact, a properly sized, lower-Vth In fact, a properly sized, lower-Vth footer transistor can preserve enough footer transistor can preserve enough leakage to keep the cell active (Li et leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)al, PACT’02; Agarwal et al, DAC’02)
Great care must be taken when switching Great care must be taken when switching back to full voltage: noise can flip bitsback to full voltage: noise can flip bits
Extra latency may be necessary when re-Extra latency may be necessary when re-activatingactivating
7979
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Low-Leakage CachesLow-Leakage Caches Gated-VGated-Vdddd/V/Vssss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28) (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)
Uses sleep transistor on VUses sleep transistor on Vdddd/ground for each cache line/ground for each cache line Typically considered non-state-preserving, but recent work (Agarwal et al, Typically considered non-state-preserving, but recent work (Agarwal et al,
DAC’02) suggests that gated-VDAC’02) suggests that gated-Vss ss it may preserve stateit may preserve state Many algorithms for determining when to gateMany algorithms for determining when to gate Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay
intervalinterval Adaptive decay intervals - hardAdaptive decay intervals - hard
Drowsy cache (Flautner et al, ISCA-29)Drowsy cache (Flautner et al, ISCA-29) Uses dual supply voltages: normal Vdd and a low Vdd close to the Uses dual supply voltages: normal Vdd and a low Vdd close to the
threshold voltagethreshold voltage State preserving, but requires an extra cycle to wake up – two extra cycles State preserving, but requires an extra cycle to wake up – two extra cycles
if tags are decayedif tags are decayed State preservation using leakage currents (Li et al, PACT’02; Agarwal State preservation using leakage currents (Li et al, PACT’02; Agarwal
et al, DAC’02)et al, DAC’02) Similar to gated-Vss but designed to keep supply voltage high enough to Similar to gated-Vss but designed to keep supply voltage high enough to
preserve state (100-120 mV)preserve state (100-120 mV)
8080
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Low Leakage Caches, cont.Low Leakage Caches, cont.
Comparison (Parikh, Li, et al, WDDD’03, DATE’04)Comparison (Parikh, Li, et al, WDDD’03, DATE’04) Compared non-state-preserving gated-Vss with state-preserving Compared non-state-preserving gated-Vss with state-preserving
drowsy cachedrowsy cache If gating is state-preserving, it wins because it essentially If gating is state-preserving, it wins because it essentially
eliminates subthreshold and gate leakageeliminates subthreshold and gate leakage• Unless wakeup time is significantly longer than with drowsyUnless wakeup time is significantly longer than with drowsy
Otherwise, drowsy cache typically has an advantage because it Otherwise, drowsy cache typically has an advantage because it is state preserving; no L2 accesses needed on “induced misses”is state preserving; no L2 accesses needed on “induced misses”
But induced misses are rare, so for a reasonable range of on-But induced misses are rare, so for a reasonable range of on-chip L2 penalties (< 8 cycles in our studies), gating can still be chip L2 penalties (< 8 cycles in our studies), gating can still be superiorsuperior
8181
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Low-Leakge Caches, cont: 4T CellsLow-Leakge Caches, cont: 4T Cells
4T-based branch predictors, caches4T-based branch predictors, caches Hu , Juang, et al, ISLPED’02, Hu , Juang, et al, ISLPED’02,
CA-Letters’02CA-Letters’02 Non state-preservingNon state-preserving Decay rate : temperature-dependentDecay rate : temperature-dependent
• Can be adjusted with passivesCan be adjusted with passives Eliminates decay state bitsEliminates decay state bits
4 transistor cells 4 transistor cells [ 4T ][ 4T ] Eliminates two Eliminates two
transistors connected transistors connected to Vddto Vdd
Naturally decays Naturally decays over timeover time
Refreshes upon Refreshes upon accessaccess
When decayed, force When decayed, force default outputdefault output
Up to 33% smaller Up to 33% smaller than equivalent 6Tthan equivalent 6T
Decays quickly [8K Decays quickly [8K cycles at 1 GHz]cycles at 1 GHz]
Leak only as much Leak only as much energy as is energy as is depositeddeposited
6T (left) and 4T (right) circuit diagrams
8282
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Low-Leakage Caches, cont: Low-Leakage Caches, cont: Other TechniquesOther Techniques
RBB (Nii et al, ISLPED’98)RBB (Nii et al, ISLPED’98) Back bias cache lines that are idle – can use the same Back bias cache lines that are idle – can use the same
decay counters as gated-Vdd/Vssdecay counters as gated-Vdd/Vss
Leakage-biased bitlines (Heo et al, ISCA-29)Leakage-biased bitlines (Heo et al, ISCA-29) Disable precharge and let the bitlines float: they will Disable precharge and let the bitlines float: they will
settle to a value that minimizes leakagesettle to a value that minimizes leakage Can only be applied to idle subbanks and requires Can only be applied to idle subbanks and requires
accurate prediction of which subbank will be accessedaccurate prediction of which subbank will be accessed
Huge variety of other techniques – this is only an Huge variety of other techniques – this is only an overview of some of the major onesoverview of some of the major ones
8383
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Register FilesRegister Files
In general, state-preserving techniques for In general, state-preserving techniques for caches may work for register files toocaches may work for register files too
Leakage-biased bitlines work here tooLeakage-biased bitlines work here too Register file divided into subbanksRegister file divided into subbanks
Alvandpour et al, Intel, ISLPED’01Alvandpour et al, Intel, ISLPED’01 Uses dual Vth and a Uses dual Vth and a conditional keeperconditional keeper
• ““Keeper” used on dynamic circuits to counteract voltage Keeper” used on dynamic circuits to counteract voltage droop due to leakage – they constitute a static pull-up pathdroop due to leakage – they constitute a static pull-up path
• Dynamic circuits arise in the muxes due to multiportingDynamic circuits arise in the muxes due to multiporting• ““Conditional” keeper technique uses two cascaded keepers; Conditional” keeper technique uses two cascaded keepers;
one is fixed and the other only engaged when needed to one is fixed and the other only engaged when needed to drive an output – requires careful timing analysisdrive an output – requires careful timing analysis
Access transistors and keepers are high-Vt/Access transistors and keepers are high-Vt/
8484
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
ALUsALUs
Usually Dual-VUsually Dual-VTT domino logic domino logic Area & SpeedArea & Speed
Sleep transistors can be used but it has a costSleep transistors can be used but it has a cost Dynamic nodes are dischargedDynamic nodes are discharged Can be used if worthyCan be used if worthy
Dropsho et al, MICRO 2002
8585
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Other TechniquesOther Techniques
Queues (eg, issue queues)Queues (eg, issue queues) Various occupancy-based or rate-matching Various occupancy-based or rate-matching
techniques have been proposed for issue queue techniques have been proposed for issue queue resizing. resizing.
Deactivating queue entries reduces leakageDeactivating queue entries reduces leakage eg, Ponomarev et al, MICRO-34eg, Ponomarev et al, MICRO-34
Compiler techniquesCompiler techniques When compiler knows that regions are idle, they can When compiler knows that regions are idle, they can
be deactivatedbe deactivated eg, Zhang et al, MICRO-35eg, Zhang et al, MICRO-35
8686
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Gate LeakageGate Leakage
Any technique that reduces VddAny technique that reduces Vdd Otherwise it seems difficult to develop architecture Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakagetechniques that directly attack gate leakage In fact, very little work has been done in this areaIn fact, very little work has been done in this area
One example: domino gates (Hamzaoglu & Stan, One example: domino gates (Hamzaoglu & Stan, ISLPED’02)ISLPED’02)
Replace traditional NMOS pull-down network with a PMOS pull-Replace traditional NMOS pull-down network with a PMOS pull-up networkup network
Gate leakage is greater in NMOS than PMOSGate leakage is greater in NMOS than PMOS But PMOS domino gate is slowerBut PMOS domino gate is slower
8787
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
RoadmapRoadmap
Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation
Sources, modeling, reduction techniquesSources, modeling, reduction techniques
Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques
SummarySummary
8888
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Other Power-Related IssuesOther Power-Related Issues
ThermalThermal Managing on-chip temperatures (as opposed to Managing on-chip temperatures (as opposed to
average heat dissipation) is not just a matter of average heat dissipation) is not just a matter of reducing average power densityreducing average power density
Spatial and temporal variationSpatial and temporal variation• Spatial: hot spots—must reduce power density in the right Spatial: hot spots—must reduce power density in the right
placesplaces• Temporal: must reduce power when chip is hotTemporal: must reduce power when chip is hot
This is often when there is less slackThis is often when there is less slack Most model temperature directly Most model temperature directly
• Average power metrics do not accurately predict temperatureAverage power metrics do not accurately predict temperature (Skadron et al, ISCA’03)(Skadron et al, ISCA’03)
8989
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
Other Power-Related IssuesOther Power-Related Issues
Voltage stability (dI/dt)Voltage stability (dI/dt) Inductance means that abrupt changes in current can Inductance means that abrupt changes in current can
cause voltage droopcause voltage droop This can be addressed with decoupling capacitance, This can be addressed with decoupling capacitance,
but required capacitance is becoming expensivebut required capacitance is becoming expensive Grochowski et al HPCA’02Grochowski et al HPCA’02, Joseph et al, HPCA’03, Joseph et al, HPCA’03
9090
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
RoadmapRoadmap
Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation
Sources, modeling, reduction techniquesSources, modeling, reduction techniques
Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques
SummarySummary
9191
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
SummarySummary
Power dissipation is becoming a huge concernPower dissipation is becoming a huge concern Total power budgetTotal power budget Power density (thermal)Power density (thermal) Energy consumption & battery lifeEnergy consumption & battery life
Power dissipationPower dissipation SwitchingSwitching Short-circuitShort-circuit LeakageLeakage
Power modeling crucialPower modeling crucial Academia: accurate researchAcademia: accurate research Industry: detect hot spots on time to meet PORIndustry: detect hot spots on time to meet POR
9292
© 2
004,
Kev
in S
kadr
on a
nd J
ose
Gon
zale
z
SummarySummary
Reducing dynamic powerReducing dynamic power Circuits perspectiveCircuits perspective
• Energy-effective access (reducing capacitance or driving Energy-effective access (reducing capacitance or driving voltage)voltage)
• GatingGating Architectural perspectiveArchitectural perspective
• Decreasing activity factorDecreasing activity factor• Pipeline gatingPipeline gating• Adjusting voltage/frequency to meet application requirementsAdjusting voltage/frequency to meet application requirements
Reducing static powerReducing static power• Dual VDual Vthth
• Non-state-preserving vs. state-preserving techniquesNon-state-preserving vs. state-preserving techniques