© 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A...

92
© 2004, Kevin Skadron and Jose Gonz Power-Aware Design for Power-Aware Design for High-Performance High-Performance Processors Processors A Tutorial at HPCA-2004 A Tutorial at HPCA-2004 Kevin Skadron Kevin Skadron Jose Jose Gonzalez Gonzalez University of Virginia University of Virginia Intel Intel Labs Barcelona Labs Barcelona

Transcript of © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A...

Page 1: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power-Aware Design for Power-Aware Design for High-Performance ProcessorsHigh-Performance Processors

A Tutorial at HPCA-2004A Tutorial at HPCA-2004

Kevin SkadronKevin Skadron Jose GonzalezJose Gonzalez University of VirginiaUniversity of Virginia Intel Labs Barcelona Intel Labs Barcelona

Page 2: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

22

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

RoadmapRoadmap

Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation

Sources, modeling, reduction techniquesSources, modeling, reduction techniques

Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques

SummarySummary

Page 3: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

33

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

IntroductionIntroduction Power: Work done per unit Power: Work done per unit time time (watts)(watts) Energy: Total Work (joules)Energy: Total Work (joules) Why is power a concern in current processorsWhy is power a concern in current processors? ? ??

Increased market demand for consumer electronics powered by batteries; Increased market demand for consumer electronics powered by batteries; battery life is a selling pointbattery life is a selling point

Electricity, cooling costs for large data centers are becoming substantialElectricity, cooling costs for large data centers are becoming substantial• 5-25% of data center 5-25% of data center incomeincome (cf. Rajamony & Bianchini tutorial, ICS’02) (cf. Rajamony & Bianchini tutorial, ICS’02)

Government energy-efficiency requirements Government energy-efficiency requirements • (eg Energy* in US)(eg Energy* in US)

Electricity costs for large ISPs are becoming substantialElectricity costs for large ISPs are becoming substantial Packaging and cooling costs (due to the increase in the power density) Packaging and cooling costs (due to the increase in the power density)

are becoming prohibitiveare becoming prohibitive Power dissipation may reach technology limitsPower dissipation may reach technology limits are becoming are becoming

prohibitiveprohibitive Current delivery is becoming expensiveCurrent delivery is becoming expensive

Page 4: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

44

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

MetricsMetricsSome different power metrics & fallacies:Some different power metrics & fallacies:

ReducingReducing power power does not always save energydoes not always save energy EnergyEnergy = = P dt P dt

• If you reduce power but increase execution time, energy If you reduce power but increase execution time, energy may go up may go up

Also note that reducing power does not always Also note that reducing power does not always reduce temperaturereduce temperature

Sustained powerSustained power densitydensity limits thermal limits thermal design/packaging design/packaging – approx. same as approx. same as thermal design powerthermal design power– note that on-chip temperatures and total heat production are note that on-chip temperatures and total heat production are

somewhat different concerns somewhat different concerns

Page 5: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

55

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

MetricsMetrics PowerPower

Average powerAverage power Power density mapPower density map

EnergyEnergy Energy (MIPS/W)Energy (MIPS/W) Energy-Delay product (MIPSEnergy-Delay product (MIPS22/W)/W) Energy-DelayEnergy-Delay22 product (MIPS product (MIPS33/W) – /W) – voltage independent!voltage independent!

TemperatureTemperature Average temperatureAverage temperature Peak temperaturePeak temperature Temperature mapTemperature map

• Does not necessarily match power density mapDoes not necessarily match power density map No good figures of merit for trading off thermal efficiency against No good figures of merit for trading off thermal efficiency against

performance, area, or energy efficiencyperformance, area, or energy efficiency

(Zyuban, GVLSI’02)

Page 6: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

66

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power DissipationPower Dissipation

Dynamic power dissipationDynamic power dissipation Due to switching activityDue to switching activity

Static power dissipationStatic power dissipation Due to leakage current – major paths are:Due to leakage current – major paths are:

• Subthreshold leakageSubthreshold leakage Exponentially dependent on Vdd, Vth, TempExponentially dependent on Vdd, Vth, Temp

• Gate leakageGate leakage Exponentially dependent on Vdd, ToxExponentially dependent on Vdd, Tox

Page 7: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

77

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power DissipationPower Dissipation

Total power actually consists ofTotal power actually consists of Switching powerSwitching power Short-circuit powerShort-circuit power Leakage powerLeakage power

Page 8: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

88

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Big Picture - TrendsBig Picture - Trends

Data on current power dissipation for various Data on current power dissipation for various chipschips

Distribution of power within a typical processorDistribution of power within a typical processor Trends in Scaling trends in power dissipationTrends in Scaling trends in power dissipation Trends in leakage powerTrends in leakage power Power Trends in battery lifePower Trends in battery life

Page 9: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

99

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power DissipationPower Dissipation

Processor Alpha 21364

AMD Opteron

HP-PA8700

IBM-Power 4

Intel Itanium 2

Intel Xeon

MIPS R14000

Clock Rate

1.15 GHz 2.2 GHz 870 MHz 1.7 GHz 1.5 GHz 3.2 GHz 600 MHz

Power (Max)

110W 86 W 75W 100W 130W 86W 16W

Source: Microprocessor Report

Page 10: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1010

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power Dissipation BreakdownPower Dissipation Breakdown

Alpha 21264Alpha 21264

Global clock network

Instruction issue units

Caches

FP execution units

Int. execution units

Mem. management unit

I/O

Miscellaneous

Source: Gowan et al. “Power Considerations in the design of the alpha 21264 microprocessor”, DAC 1998

Page 11: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1111

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Effects of Technology Scaling onEffects of Technology Scaling onPower DissipationPower Dissipation

Feature size is scaling downFeature size is scaling down 30%30%

Frequency is increasingFrequency is increasing ~2x (Ideal scaling: ~2x (Ideal scaling: decreases decreases by 30%) by 30%)

Area increases due to microarchitecture improvementsArea increases due to microarchitecture improvements 25% (Ideal scaling: 25% (Ideal scaling: decreases decreases byby 50%)50%)

Active capacitance increasesActive capacitance increases at least 30% (Ideal scaling: at least 30% (Ideal scaling: decreases decreases by 30%)by 30%)

Vdd is not scaled down at the same rate as feature sizeVdd is not scaled down at the same rate as feature size 0-10% (Ideal scaling) 30%0-10% (Ideal scaling) 30%

Ideal scaling: P Ideal scaling: P CV CV22f f → 0.7→ 0.722 reduction reduction 0.5 0.5 Observed scaling → 2 – 2.5x Observed scaling → 2 – 2.5x increaseincrease Power density becomes a problem!Power density becomes a problem!

Especially since the power density is non-uniformEspecially since the power density is non-uniform

Page 12: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1212

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power EvolutionPower EvolutionM

ax

Po

we

r (W

att

s)

i386 i386

i486 i486

Pentium® Pentium®

Pentium® w/MMX tech.

Pentium® w/MMX tech.

1

10

100

Pentium® Pro Pentium® Pro

Pentium® II Pentium® II

Pentium® 4Pentium® 4Pentium® 4Pentium® 4

??

Pentium® III Pentium® III

Source: Intel

Page 13: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1313

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Trends in Power DensityTrends in Power DensityW

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Pentium® 4Pentium® 4

Page 14: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1414

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

ITRS ProjectionsITRS ProjectionsYear 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.0 0.9 0.6 0.5 0.4Vdd (low power) (V) 1.1 1.0 0.8 0.7 0.6Frequency (high perf) (GHz) 3.1 5.6 11.5 19.3 28.8

High-perf w/ heatsink 160 180 218 251 288Cost-performance 85 98 120 138 158Hand-held 3.2 3.5 3.0 3.0 3.0

Max power (W)

These are targetsThese are targets Based on historical trends, the high-performance power targets Based on historical trends, the high-performance power targets

seem optimisticseem optimistic Intel papers suggest that in the 45-75W range, cooling costs $1/W; Intel papers suggest that in the 45-75W range, cooling costs $1/W;

but then rate of increase goes up: $2, $3/W, maybe more!but then rate of increase goes up: $2, $3/W, maybe more!(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)

ITRS 2001

Page 15: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1515

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Leakage PowerLeakage Power The fraction of leakage power is increasing The fraction of leakage power is increasing

exponentially with each generationexponentially with each generation Also exponentially dependent on temperatureAlso exponentially dependent on temperature

Static power/ Dynamic Power

010203040506070

Temperature(K)

Pe

rce

nta

ge

180nm 130nm 100nm 90nm 80nm 70nm

Increasingratioacrossgenerations

Source: Skadron et al, University of Virginia

Page 16: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1616

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Trends in Battery TechnologyTrends in Battery Technology

Battery lifetime is increasing perhaps 8-10%/yr.(Powers, Proc. of IEEE 1995)

Not keeping up with rate of growth in energy consumption

Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”, tutorial at PACT 2000

Page 17: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1717

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

RoadmapRoadmap

Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation

Sources, modeling, reduction techniquesSources, modeling, reduction techniques

Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques

SummarySummary

Page 18: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1818

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Power DissipationDynamic Power Dissipation

RoadmapRoadmap Sources of dynamic power dissipationSources of dynamic power dissipation Modeling dynamic powerModeling dynamic power Circuit- and architecture-domain techniques to reduce Circuit- and architecture-domain techniques to reduce

powerpower

Page 19: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

1919

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Power ConsumptionDynamic Power Consumption

Power dissipated due to switching activityPower dissipated due to switching activity A capacitance is charged and discharged A capacitance is charged and discharged

Vdd

0 11 0

Charge/discharge at the frequency Charge/discharge at the frequency ffP=CLV2 f

Ec=1/2CLV2

Ed=1/2CLV2

Note that energy consumed from battery is CLV2 and is drawn upon charging

Page 20: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2020

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Power DissipationDynamic Power Dissipation

EquationEquationP = P = a CL Vdd

2 f

a: Activity factor a: Activity factor Depends on the processor architectureDepends on the processor architecture

CCLL: Capacitance of the circuit: Capacitance of the circuit Depends on the design style, number of transistors, Depends on the design style, number of transistors,

transistor sizing, etctransistor sizing, etc

VVdddd: Operating voltage: Operating voltage

f: Frequencyf: Frequency

Page 21: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2121

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Power ModellingDynamic Power Modelling

P = P = a CL V2 f Information neededInformation needed

Activity counters in each unitActivity counters in each unit Energy dissipated per accessEnergy dissipated per access

For precision, “a” (# of signal transitions) should be measured or at For precision, “a” (# of signal transitions) should be measured or at least estimated with a probabilistic modelleast estimated with a probabilistic model

More commonly, a = 0.5 is assumedMore commonly, a = 0.5 is assumed

Performance Performance ModelModel

Power Power ModelModel

ConfigurationConfiguration

ActivityActivity

Performance metricsPerformance metrics Power metricsPower metrics

Page 22: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2222

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Power ModellingDynamic Power Modelling Activity countersActivity counters

Performance model is usedPerformance model is used Counters for: cache access, FU usage, Register File, ...Counters for: cache access, FU usage, Register File, ...

Energy per accessEnergy per access Analytically: calculating capacitances as function of size, ports, etcAnalytically: calculating capacitances as function of size, ports, etc Example: Cache access: decoder, precharge transistors, bitline, cell Example: Cache access: decoder, precharge transistors, bitline, cell

access, wordline, sense amplifiers ...access, wordline, sense amplifiers ...• Wattch (Brooks Wattch (Brooks et alet al, ISCA 2000, ISCA 2000))• Cacti Cacti

Empirically: using low level designs and applying “virus” testsEmpirically: using low level designs and applying “virus” tests• Virus test: microbenchmark that stresses a particular unitVirus test: microbenchmark that stresses a particular unit• ALPS (Gunther ALPS (Gunther et al,et al, ITJ, 2001) ITJ, 2001)

Circuit-extracted modelCircuit-extracted model PowerTimer – IBM Power4 (Brooks et al, PACS’00)PowerTimer – IBM Power4 (Brooks et al, PACS’00) AccuPower – Parameterized, based on SPICE measurements of actual AccuPower – Parameterized, based on SPICE measurements of actual

layouts (SUNY Binghamton, Ponomarev et al, DATE’02)layouts (SUNY Binghamton, Ponomarev et al, DATE’02) PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)

Many of these ignore the actual number of signal transitionsMany of these ignore the actual number of signal transitions

Page 23: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2323

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Circuit-Level Techniques Circuit-Level Techniques

Transistor sizingTransistor sizing Signal and clock gatingSignal and clock gating Circuit restructuringCircuit restructuring Low power cachesLow power caches Low power register filesLow power register files Issue queueIssue queue

These typically reduce the capacitance being These typically reduce the capacitance being switchedswitched

Page 24: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2424

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Transistor SizingTransistor Sizing

Transistor sizing plays an important role to reduce powerTransistor sizing plays an important role to reduce power

Delay ~ Delay ~ (k / ln K)(k / ln K) Power ~ K / (K-1)Power ~ K / (K-1) Optimum K for both power and delay must be pursuedOptimum K for both power and delay must be pursued

C0 C1 CN-1 CN

K = Ci/Ci-1

Page 25: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2525

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Signal GatingSignal Gating

ImplementationImplementation Simple gateSimple gate Tristate bufferTristate buffer ......

Control signal neededControl signal needed Generation requires additional logicGeneration requires additional logic

Identification of signals to be gatedIdentification of signals to be gated Clock Clock Address busAddress bus

Also helps to prevent power dissipation due to glitchesAlso helps to prevent power dissipation due to glitches

““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””

signal

ctrl

Output

Page 26: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2626

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Clock GatingClock Gating

ImplementationImplementation Simple gate that replacesSimple gate that replaces

one buffer in the clock treeone buffer in the clock tree Delay is generally not a concernDelay is generally not a concern

DecisionDecision Architectural levelArchitectural level

““Disabling a functional block when it is not required for a extended Disabling a functional block when it is not required for a extended periodperiod””

signal

ctrl

functionalunit

functionalunit

Page 27: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2727

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Circuit RestructuringCircuit Restructuring

Pipeline (can reduce frequency)Pipeline (can reduce frequency) Parallelize (can reduce frequency)Parallelize (can reduce frequency) Reorder inputs so that most active input is Reorder inputs so that most active input is

closest to output (reduces switched capacitance)closest to output (reduces switched capacitance) Restructure gates (equivalent functions are not Restructure gates (equivalent functions are not

equivalent in switched capacitance)equivalent in switched capacitance) Energy-efficient flip-flops and latchesEnergy-efficient flip-flops and latches

Page 28: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2828

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Cache DesignCache Design

CCaccessaccess = R = R C C C Ccellcell

Reducing powerReducing power Switched capacitanceSwitched capacitance Voltage swingVoltage swing Activity factorActivity factor FrequencyFrequency

sens amp

Column dec

row

dec bi

tline

bitli

neR rowsC cols

0

10

20

30

40

50

60

70

80

Decode

r

Wlin

es

TBLSA

DBLSA

I/O b

uses

Oth

er

Read

Write

TBLSA: Tagbitlines & sense amp.DBLSA: Data bitlines and sense amp.

Cache parameters: 16 KB cache 0.25 μm

wordline

Villa et al, MICRO 2000

Page 29: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

2929

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Cache DesignCache Design Banked organizationBanked organization

Targets switched capacitanceTargets switched capacitance CCaccessaccess = R = R C C C Ccellcell / B/ B

Dividing word line Dividing word line Same effect for wordlinesSame effect for wordlines

Reducing voltage swingsReducing voltage swings Sense amplifiers used to detect VSense amplifiers used to detect Vdiffdiff across bitlines across bitlines Read operation can be curtailed as soon as VRead operation can be curtailed as soon as Vdiff diff is detectedis detected Limiting voltage swing saves a fraction of powerLimiting voltage swing saves a fraction of power

Pulse word linesPulse word lines Enabling the word line for the time needed to discharge bitcell Enabling the word line for the time needed to discharge bitcell

voltagevoltage Designer needs to estimate access time and implement a pulse Designer needs to estimate access time and implement a pulse

generatorgenerator

Page 30: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3030

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Low Power Register File DesignLow Power Register File Design

RF’s usually single-ended bitlinesRF’s usually single-ended bitlines Modified storage cellModified storage cell

Lot of zeros fetched from the RFLot of zeros fetched from the RF Bitline connections are modified to eliminate bitline discharge Bitline connections are modified to eliminate bitline discharge

when reading a zerowhen reading a zero

Tseng and Asanovic, ICSD, 2000Zyuban and Kogge, ISLPED 1998

Page 31: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3131

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Efficient Issue QueueEfficient Issue Queue

Constitute a high fraction of the overall powerConstitute a high fraction of the overall power >25% for some authors>25% for some authors

Tag 1Tag w

compOR OR

comp

comp

comp

RDY Oprnd RDYOprnd

Page 32: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3232

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Efficient Issue QueueEfficient Issue Queue

Useful comparisonUseful comparison Empty entries and ready entries consume energyEmpty entries and ready entries consume energy

• Wakeup of empty entries can be disabledWakeup of empty entries can be disabled Gating off precharge logic using valid bitGating off precharge logic using valid bit

• Wakeup of ready sources can be disabledWakeup of ready sources can be disabled Gating off precharge logic using ready bitGating off precharge logic using ready bit

Folegnani and Gonzalez, ISCA 2001Folegnani and Gonzalez, ISCA 2001

Energy-efficient ComparatorsEnergy-efficient Comparators Traditional comparators dissipate energy on a mismatch in any Traditional comparators dissipate energy on a mismatch in any

bit position.bit position. 10%-20% of source operands match each cycle10%-20% of source operands match each cycle Solution: comparators that dissipate energy in a matchSolution: comparators that dissipate energy in a match

Kuckuc Kuckuc et alet al, ISLPED 2001, ISLPED 2001

Page 33: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3333

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Architectural-Level TechniquesArchitectural-Level Techniques

Encoding/compressionEncoding/compression Energy-efficient front endEnergy-efficient front end Energy-efficient cachesEnergy-efficient caches Asymmetric processorsAsymmetric processors Dynamic Voltage/Frequency scalingDynamic Voltage/Frequency scaling Multi clock domain architectures (similar to GALS)Multi clock domain architectures (similar to GALS) Pipeline gatingPipeline gating Compiler techniquesCompiler techniques Sleep modesSleep modes

These typically take advantage of locality or slackThese typically take advantage of locality or slack

Page 34: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3434

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Bus Invert EncodingBus Invert Encoding

Reduce power of parallel synchronous signalsReduce power of parallel synchronous signals Idea: Minimize the number of transitions Idea: Minimize the number of transitions

• (Stan & Burleson, IEEE Trans. on VLSI, 1995)(Stan & Burleson, IEEE Trans. on VLSI, 1995) Sender examines the current and the next valuesSender examines the current and the next values Decides whether sending the true or the compliment signalDecides whether sending the true or the compliment signal Additional polarity signal is sent along with dataAdditional polarity signal is sent along with data ExampleExample

000100110

110011101Current data

Next data

Number oftransitions

8

NOT (Next data) 111011001

Number oftransitions

2

110011101Current data

Page 35: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3535

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Zero CompressionDynamic Zero Compression Zero Indicator Bit (ZIB) added to each byteZero Indicator Bit (ZIB) added to each byte

Enabled if a zero is stored in cacheEnabled if a zero is stored in cache On a read access, bitline discharge is prevented by disabling On a read access, bitline discharge is prevented by disabling

local wordlinelocal wordline On a write, if the byte is zero, just ZIB is written.On a write, if the byte is zero, just ZIB is written.

Circuit ModificationsCircuit Modifications Zero-detection and store bus driversZero-detection and store bus drivers Wordline gating: 8-bit data is driven by the associated ZIBWordline gating: 8-bit data is driven by the associated ZIB Sense Amps: modified to drive a zero if ZIB activeSense Amps: modified to drive a zero if ZIB active

DrawbacksDrawbacks 9% area increase, 2-gate delay increase9% area increase, 2-gate delay increase

ResultsResults 26% energy reduction data cache, 10% instruction cache26% energy reduction data cache, 10% instruction cache

Villa et al, MICRO 2000

Page 36: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3636

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Exploiting Narrow Width OperandsExploiting Narrow Width Operands High percentage of integer operations require <16 bitsHigh percentage of integer operations require <16 bits

Difficult for the compiler to know the actual operand sizeDifficult for the compiler to know the actual operand size Variability for the same instruction in successive instancesVariability for the same instruction in successive instances

Clock Gating is used to partially disable the FUClock Gating is used to partially disable the FU

zero48

clkAND

Highlatch

Lowlatch

OperandA

zero48

clk AND

Highlatch

Lowlatch

OperandB

Inte

ger

FU

Zerodetec

AND

64

64

0-15

1

16-63

64

64

0Result

zero48

zero48

clkAND

Highlatch

Lowlatch

OperandA

zero48

clk AND

Highlatch

Lowlatch

OperandB

Inte

ger

FU

Zerodetec

AND

64

64

0-15

1

16-63

64

64

0Result

zero48

Brooks and Martonosi, HPCA 1999

Page 37: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3737

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Energy-Efficient Front End: Energy-Efficient Front End: Branch PredictionBranch Prediction

Branch PredictionBranch Prediction Parikh et al, HPCA’02, IEEE Trans. Computers ‘04Parikh et al, HPCA’02, IEEE Trans. Computers ‘04 Branch prediction accuracy is a major determinant of Branch prediction accuracy is a major determinant of

pipeline activity -> spending pipeline activity -> spending more powermore power in the branch in the branch predictor can be worthwhile if it improves accuracypredictor can be worthwhile if it improves accuracy

Branch predictors can be designed to reduce power, egBranch predictors can be designed to reduce power, eg• BankingBanking• Gate off unnecessary accesses (“prediction probe detector”)Gate off unnecessary accesses (“prediction probe detector”)

Page 38: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3838

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Energy Efficient Front End:Energy Efficient Front End:Register RenamingRegister Renaming

RAT often implemented as a multiported register file RAT often implemented as a multiported register file indexed by logical register, returns physical registerindexed by logical register, returns physical register

Liu and Lu , MICRO’00Liu and Lu , MICRO’00 Hierarchical RAT- top level is a cache of the full tableHierarchical RAT- top level is a cache of the full table

Kucuk et al, PATMOS’03Kucuk et al, PATMOS’03 Prevent lookup of sources that will be supplied by a freshly Prevent lookup of sources that will be supplied by a freshly

renamed instruction in the same rename grouprenamed instruction in the same rename group Filter cacheFilter cache

Could instead organize as an associative lookup in a Could instead organize as an associative lookup in a table organized by physical register with dissipate-on-table organized by physical register with dissipate-on-match comparator (Ergin et al, ICCD’02)match comparator (Ergin et al, ICCD’02)

Page 39: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

3939

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Energy-Efficient CachesEnergy-Efficient Caches

Filter cacheFilter cache Small L0 cache filters many accesses to L1, allows an L1 with Small L0 cache filters many accesses to L1, allows an L1 with

fewer ports (Kin et al, MICRO-30)fewer ports (Kin et al, MICRO-30) BanksBanks Selective cache ways (Albonesi, MICRO-32)Selective cache ways (Albonesi, MICRO-32)

Ways in a set associative cache can be disabled if not neededWays in a set associative cache can be disabled if not needed Many variations of this approachMany variations of this approach

Staggering number of papers on this topicStaggering number of papers on this topic Exploit victim cache, load-store queueExploit victim cache, load-store queue Clever cache organizations (eg combining banks w/ high assoc, Clever cache organizations (eg combining banks w/ high assoc,

specialized caches, etc.)specialized caches, etc.) See recent proceedings of VLSI, architecture conferences, See recent proceedings of VLSI, architecture conferences,

esp. ISLPEDesp. ISLPED

Page 40: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4040

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Asymmetric ProcessorsAsymmetric Processors

Processors have different “versions” of the same Processors have different “versions” of the same resource, with different power/latencyresource, with different power/latency

Fast, power-hungry resources are allocated to critical Fast, power-hungry resources are allocated to critical instructionsinstructions

Slow, low-power resources are allocated to non-critical Slow, low-power resources are allocated to non-critical instructionsinstructions

Criticality predictor is needed!!!Criticality predictor is needed!!!

Page 41: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4141

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Asymmetric ProcessorsAsymmetric Processors

Reducing power of functional unitsReducing power of functional units 2 sets of functional units2 sets of functional units 2 sets of instruction queues2 sets of instruction queues Criticality predictorCriticality predictor

Critical instructionsCritical instructions In-order queue: critical path is usually a serial chain of In-order queue: critical path is usually a serial chain of

dependent instructionsdependent instructions Fast functional unitsFast functional units

Non-critical instructionsNon-critical instructions OoO queueOoO queue Slow functional unitsSlow functional units

Seng et al, MICRO 2001

Page 42: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4242

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dual Speed PipelinesDual Speed Pipelines

Slow pipeline works at half the frequencySlow pipeline works at half the frequency Criticality predictor key component to keep energy-efficiencyCriticality predictor key component to keep energy-efficiency No communications penaltiesNo communications penalties

Fet

ch

Dec

ode

Slow pipeline

Fast pipeline

RegFile C

omm

it

Criticalitypredictor

Pyreddy and Tyson, WCED 2001

Page 43: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4343

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Voltage/Frequency ScalingDynamic Voltage/Frequency Scaling

Allow the device to dynamically adapt the voltage (and the Allow the device to dynamically adapt the voltage (and the frequency)frequency)

P ~ VP ~ Vdddd22

F ~ VF ~ Vdddd/(V/(Vdddd-V-Vthth))kk

Tradeoff between power reductions and delay increaseTradeoff between power reductions and delay increase MUST BE energy-efficientMUST BE energy-efficient

Already implemented in many processorsAlready implemented in many processors ImplementationImplementation

Voltage regulatorVoltage regulator Predict future processor utilization and adjust frequency/voltage to Predict future processor utilization and adjust frequency/voltage to

maximize power reduction while keeping performancemaximize power reduction while keeping performance

Page 44: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4444

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

TransmetaTransmetaTMTM LongRun LongRunTMTM

Crusoe processor can configure itselfCrusoe processor can configure itself**

Voltage changes in steps of 25 mV (depending on the voltage Voltage changes in steps of 25 mV (depending on the voltage regulator)regulator)

Frequency changes in steps of 33 MHzFrequency changes in steps of 33 MHz From 1.6v, 600 MHz to 1.2V, 300MHz (2001)From 1.6v, 600 MHz to 1.2V, 300MHz (2001)

ManagementManagement Implemented in the Code MorphingImplemented in the Code MorphingTMTM software layer software layer Idle time of the system is sampled to determine performance Idle time of the system is sampled to determine performance

demandsdemands

Thermal extensionThermal extension May be a form of thermal throttlingMay be a form of thermal throttling Expands the thermal budget of the processorExpands the thermal budget of the processor

* Source: http://www.transmeta.com

Page 45: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4545

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

TransmetaTransmeta™™LongRunLongRun™™

Idle timeIdle time Voltage drops to minimumVoltage drops to minimum

On-line activityOn-line activity Voltage raises to maximumVoltage raises to maximum

Real-Time activityReal-Time activity Voltage adjusted to meet Voltage adjusted to meet

requirementsrequirements DVD playerDVD player

• 24 frames/second24 frames/second

Source: Transmeta

Page 46: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4646

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

IntelIntel SpeedStepSpeedStep®®

ConfigurationConfiguration**

From 0.844v (600MHz) to 1.48v (1.7 GHz)From 0.844v (600MHz) to 1.48v (1.7 GHz) 100100μμs delays delay Voltage-Frequency switching separationVoltage-Frequency switching separation

* Source: http://www.intel.com

No Change

Freq. Transition

Volt. Transition

Volt. Transition

Freq. Transition

Page 47: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4747

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

IntelIntel SpeedStepSpeedStep®®

ConfigurationConfiguration Clock partitioningClock partitioning

• Core clockCore clock

• Bus clock (sequencer and interrupt interface)Bus clock (sequencer and interrupt interface) Event blockingEvent blocking

• Interrupts, pin events and snoop requests are not lostInterrupts, pin events and snoop requests are not lost

Page 48: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4848

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Voltage SchedulingVoltage Scheduling

Real-time problem will be discussed laterReal-time problem will be discussed later For non-real time workload, goal is to improve For non-real time workload, goal is to improve

energy efficiencyenergy efficiency This is hard, because it is difficult to predict an This is hard, because it is difficult to predict an

arbitrary workload’s future needs without arbitrary workload’s future needs without deadline informationdeadline information

Instead, try to schedule processes and voltages Instead, try to schedule processes and voltages to reduce idle timeto reduce idle time eg, Weiser et al, OSDI-1eg, Weiser et al, OSDI-1

Page 49: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

4949

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Sleep Modes Sleep Modes

ACPI: Advance Configuration and Power InterfaceACPI: Advance Configuration and Power Interface Developed by Microsoft, HP, Toshiba, Phoenix and IntelDeveloped by Microsoft, HP, Toshiba, Phoenix and Intel

Establishes interfaces for OS-directed power-Establishes interfaces for OS-directed power-managementmanagement

Replaces APM, MPS APIs and PnP BIOSReplaces APM, MPS APIs and PnP BIOS DefinesDefines

Hardware registersHardware registers BIOS interfacesBIOS interfaces System and device power statesSystem and device power states

Source: ACPI overview, http://www.acpi.info

Page 50: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5050

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

DVS “Critical Power Slope”DVS “Critical Power Slope”

It may be more efficient It may be more efficient notnot to use DVS, and to to use DVS, and to run at the highest possible frequency, then go run at the highest possible frequency, then go into a sleep mode!into a sleep mode! Depends on power dissipation in sleep modeDepends on power dissipation in sleep mode And power dissipation at lowest voltageAnd power dissipation at lowest voltage

This has been formalized as the critical power This has been formalized as the critical power slope (Miyoshi et al, ICS’02):slope (Miyoshi et al, ICS’02): mmcriticalcritical = (P = (Pffminmin

– P – Pidleidle) / f) / fminmin

If the actual slope m = (PIf the actual slope m = (Pff - P - Pffminmin) / (f – f) / (f – fminmin) < m) < mcriticalcritical

then it is more energy efficient to run at the highest then it is more energy efficient to run at the highest frequency, then go to sleepfrequency, then go to sleep

Switching overheads must be taken into accountSwitching overheads must be taken into account

Page 51: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5151

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

Multiple clock domains inside the processorMultiple clock domains inside the processor Globally-asynchronous locally synchronous Globally-asynchronous locally synchronous

(GALS) clock style(GALS) clock style IndependentIndependent voltage/frequency scaling voltage/frequency scaling Synchronizers to ensure inter-domain Synchronizers to ensure inter-domain

communicationcommunication

Page 52: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5252

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

AdvantagesAdvantages Local clock design is not aware of global skewLocal clock design is not aware of global skew Each domain limited by its local critical path, allowing higher Each domain limited by its local critical path, allowing higher

frequenciesfrequencies Different voltage regulators allow for a finer-grain energy controlDifferent voltage regulators allow for a finer-grain energy control Frequency/voltage of each domain can be tailored to its dynamic Frequency/voltage of each domain can be tailored to its dynamic

requirementsrequirements Clock Power is reducedClock Power is reduced

DrawbacksDrawbacks Complexity and penalty of synchronizersComplexity and penalty of synchronizers Feasibility of multiple voltage regulatorsFeasibility of multiple voltage regulators

Page 53: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5353

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

Src runs with Src runs with CLKCLK11, dst , dst

with with CLKCLK22

Src writes at Src writes at TT11

If If TT > > TTss then dst can use then dst can use

the data at the data at TT22

If If TT < < TTss then dst can use then dst can use

the data at the data at TT33

T

CLK1

CLK2

1

2 3

4

Semeraro et al, ISCA 2003

SynchronizationSynchronization

Page 54: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5454

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

Domains must be carefully chosenDomains must be carefully chosen Small cost on communicationsSmall cost on communications Re-using existing structuresRe-using existing structures

ExampleExample 5 domains5 domains

• Front-endFront-end• Integer unitInteger unit• FP unitFP unit• On-chip cache unitOn-chip cache unit• Main memoryMain memory

Page 55: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5555

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

L2unifiedcache

L1d-cache

LSQ

MemoryFront-end

branchpredict renameL1

i-cache

fetch dispatchIFQ

int.registerfile

int.FUs

IIQ

Integer

fp.registerfile

fp.FUs

FIQ

Floating Point

MainMemory

CPU

Magklis et al, ISCA 2003

Page 56: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5656

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

Dynamic voltage/frequency scaling in each domainDynamic voltage/frequency scaling in each domain Reconfiguration points must be chosenReconfiguration points must be chosen

Off-line “shaker” algorithmOff-line “shaker” algorithm• Aggressive oracle algorithm with good resultsAggressive oracle algorithm with good results• Uses detailed dynamic execution trace to find frequenciesUses detailed dynamic execution trace to find frequencies• It is not practical, requires future knowledge of this precise dynamic It is not practical, requires future knowledge of this precise dynamic

runrun On-line Attack-decayOn-line Attack-decay

• Interval-based hardware algorithmInterval-based hardware algorithm• Transparent to the application, minimal overheadTransparent to the application, minimal overhead• More conservative, achieves 75% efficiency of off-lineMore conservative, achieves 75% efficiency of off-line

Profile-basedProfile-based• Use profiling to associate frequencies with parts of the codeUse profiling to associate frequencies with parts of the code• When these points in the code are reached during a dynamic run When these points in the code are reached during a dynamic run

then change frequenciesthen change frequencies

Page 57: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5757

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Gating/ThrottlingGating/Throttling

Gating: Disable some of the stages of the processorGating: Disable some of the stages of the processor To reduce useless activity: after a branch mispredictionTo reduce useless activity: after a branch misprediction

Manne et al, ISCA 1998Manne et al, ISCA 1998 Effectiveness is heavily dependent on accuracy of branch Effectiveness is heavily dependent on accuracy of branch

confidence predictor confidence predictor Parikh et al, HPCA’02Parikh et al, HPCA’02

Throttling: Slow down some processor stage when it is Throttling: Slow down some processor stage when it is predicted that the performance predicted that the performance will notwill not be reduced be reduced

Branch mispredictionBranch misprediction Long latency load missLong latency load miss IPC reduction in generalIPC reduction in general

Baniasadi and Moshovos, ISLPED 2001Baniasadi and Moshovos, ISLPED 2001

Page 58: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5858

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Selective Throttling for Control SpeculationSelective Throttling for Control Speculation

Control Speculation increases power dissipation (28%)Control Speculation increases power dissipation (28%) Energy wasted by mispredicted instructionsEnergy wasted by mispredicted instructions

Selective throttling of fetch/decodeSelective throttling of fetch/decode Based on branch confidenceBased on branch confidence

Gating of selection stage Gating of selection stage Instructions that likely belong to a mispredicted pathInstructions that likely belong to a mispredicted path

9% Energy-Delay improvement9% Energy-Delay improvement

Aragon et al, HPCA 2003

Page 59: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

5959

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Co-Adaptive Instruction Fetch and IssueCo-Adaptive Instruction Fetch and Issue

Fetch gating based on issue queue utilizationFetch gating based on issue queue utilization Rather than using instruction window usageRather than using instruction window usage

Fetch is stopped if Fetch is stopped if close parallelism close parallelism is presentis present Just instructions from the head of the IQ are issuedJust instructions from the head of the IQ are issued To match the size of the window residing in the IQ to To match the size of the window residing in the IQ to

application’s ILPapplication’s ILP

Fetch gating combined with dynamic issue queue Fetch gating combined with dynamic issue queue adaptationadaptation

20% energy-delay improvement20% energy-delay improvement

Buyuktosunoglu et al, ISCA 2003

Page 60: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6060

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Compiler Techniques for Low PowerCompiler Techniques for Low Power

Good reference: tutorial by Kremer, PLDI’03Good reference: tutorial by Kremer, PLDI’03 Traditional compiler optimizations often improve Traditional compiler optimizations often improve

energy efficiency energy efficiency eg, register allocation, CSE, tiling for cache hit rateeg, register allocation, CSE, tiling for cache hit rate

But some compiler optimizations waste energyBut some compiler optimizations waste energy eg, aggressive speculationeg, aggressive speculation

Energy efficiency of code sequences is highly Energy efficiency of code sequences is highly dependent on microarchitecturedependent on microarchitecture eg, free slot in a VLIW wordeg, free slot in a VLIW word

Page 61: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6161

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Compiler Techniques for Low Power, cont.Compiler Techniques for Low Power, cont.

Compiler-guided DVSCompiler-guided DVS v1: reduce voltage while meeting real-time deadlinesv1: reduce voltage while meeting real-time deadlines v2: reduce voltage in memory-bound program regionsv2: reduce voltage in memory-bound program regions

• Hsu and Kremer, ISLPED’01, PLDI’03Hsu and Kremer, ISLPED’01, PLDI’03• Xie et al, PLDI’03Xie et al, PLDI’03

Dynamic resource configuration/hibernationDynamic resource configuration/hibernation Deactivate modules when they won’t be used for a long time (>> Deactivate modules when they won’t be used for a long time (>>

sleep/wakeup time)sleep/wakeup time)• Heath et al, PACT’02Heath et al, PACT’02

Profile/compiler-guided adaptationProfile/compiler-guided adaptation eg,profile-guided MCD adaptation mentioned earlier (Magklis et eg,profile-guided MCD adaptation mentioned earlier (Magklis et

al, ISCA’03)al, ISCA’03) eg, subroutine-guided (“positional”) adapation (Huang et al, eg, subroutine-guided (“positional”) adapation (Huang et al,

ISCA’03)ISCA’03)• Uses a hierarchy of low-power modesUses a hierarchy of low-power modes

Much work in this area – this only touches the surfaceMuch work in this area – this only touches the surface

Page 62: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6262

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Power Savings for Real Time SystemsPower Savings for Real Time Systems

Soft vs. hard real timeSoft vs. hard real time Periodic vs. aperiodicPeriodic vs. aperiodic

Periodic tasks are especially important in control systemsPeriodic tasks are especially important in control systems

Most work has focused on DVS schedulingMost work has focused on DVS scheduling ExamplesExamples

MPEG playbackMPEG playback Web serverWeb server

Page 63: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6363

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

DVS for Multimedia AppsDVS for Multimedia Apps(soft real-time approach)(soft real-time approach)

MM apps must process every frame within a time limitMM apps must process every frame within a time limit If If idleidle time, then there is some time, then there is some slackslack IPC is constant across frames of the same typeIPC is constant across frames of the same type

Slow down the processor to meet deadlinesSlow down the processor to meet deadlines 2 Phases2 Phases

Profiling Profiling • Determines max. number of insts. can be executed for each confDetermines max. number of insts. can be executed for each conf

• Sorts that listSorts that list AdaptationAdaptation

• Predicts the number of instructions to be executed in the next intervalPredicts the number of instructions to be executed in the next interval

• Uses the lowest energy hardware configuration that fulfills Uses the lowest energy hardware configuration that fulfills requirementsrequirements

Hughes et al MICRO 2001

Page 64: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6464

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

DVS for Multimedia AppsDVS for Multimedia Apps(hard real-time approach)(hard real-time approach)

Buffering decoded frames provides a Buffering decoded frames provides a control point to enforce deadlines using control point to enforce deadlines using feedback controlfeedback control Dead-zone proportional-integral controller sets Dead-zone proportional-integral controller sets

DVS to maintain queue occupancyDVS to maintain queue occupancy No profiling or other prior knowledge about No profiling or other prior knowledge about

stream is neededstream is needed If queue becomes empty, “panic” model forces If queue becomes empty, “panic” model forces

highest speedhighest speed

Lu et al ICCD 2003

deadzone

increasefrequency

decreasefrequency

Page 65: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6565

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

DVS for Web ServersDVS for Web Servers

Basic idea: load balance, then do DVS to Basic idea: load balance, then do DVS to reclaim slack (Elnozahy et al, PACS’02)reclaim slack (Elnozahy et al, PACS’02) But it may be more profitable to cluster requests onto But it may be more profitable to cluster requests onto

fewer nodes and put some to sleepfewer nodes and put some to sleep Even on single nodes, it may be profitable to Even on single nodes, it may be profitable to

briefly defer requests, then batch them at the briefly defer requests, then batch them at the highest frequency before going to sleep highest frequency before going to sleep (Elnozahy et al, USITS’03)(Elnozahy et al, USITS’03)

To provide delay guarantees requires feedback To provide delay guarantees requires feedback control (control (Sharma et al RTSS 2001)Sharma et al RTSS 2001) A natural and effective control point is A natural and effective control point is synthetic synthetic

utilizationutilization• Combines true utilization with real-time schedulabilityCombines true utilization with real-time schedulability

Page 66: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6666

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Other ApproachesOther Approaches

Almost all RT algorithms attempt to reclaim slackAlmost all RT algorithms attempt to reclaim slack

EpisodeEpisode detection (Flautner et al, MOBICOM’01) detection (Flautner et al, MOBICOM’01) Identify interactive and periodic events, schedule accordinglyIdentify interactive and periodic events, schedule accordingly

Program checkpoints – check performance relative to Program checkpoints – check performance relative to deadline and adjust DVS accordinglydeadline and adjust DVS accordingly

Exploit direct knowledge of task execution times or Exploit direct knowledge of task execution times or utilizationutilization

VISA (Anantaraman et al, ISCA’03)VISA (Anantaraman et al, ISCA’03) Model a superscalar (unpredictable processor) as a predictable Model a superscalar (unpredictable processor) as a predictable

scalar processor to perform RT analysis and scheduling, then scalar processor to perform RT analysis and scheduling, then reduce DVS setting when superscalar processor runs faster than reduce DVS setting when superscalar processor runs faster than predictedpredicted

Use program checkpoints to check progress/slackUse program checkpoints to check progress/slack

Page 67: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6767

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Short-Circuit PowerShort-Circuit Power

Main solutions are Main solutions are Reduce rise/fall timesReduce rise/fall times

• Tradeoff: reducing rise/fall times requires stronger drivers, Tradeoff: reducing rise/fall times requires stronger drivers, more dynamic powermore dynamic power

Reduce capacitance being switchedReduce capacitance being switched

Page 68: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6868

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

RoadmapRoadmap

Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation

Sources, modeling, reduction techniquesSources, modeling, reduction techniques

Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques

SummarySummary

Page 69: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

6969

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Static Power DissipationStatic Power Dissipation

Static power: dissipation due to leakage currentStatic power: dissipation due to leakage current Growing worse because VGrowing worse because Vthth is not scaling as fast is not scaling as fast

as Vas Vdddd

RoadmapRoadmap Most important sources of static power: subthreshold Most important sources of static power: subthreshold

leakage and gate leakageleakage and gate leakage Inter-process variationInter-process variation TrendsTrends ModelingModeling leakage power leakage power Circuit/architectural-level techniquesCircuit/architectural-level techniques

Page 70: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7070

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Static PowerStatic Power

Main mechanisms for leakage currentMain mechanisms for leakage current Subthreshold (Berkely predictive model):Subthreshold (Berkely predictive model):

GateGate• IIgategate = I = Igate0 gate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0)) * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0))

We will focus on subthresholdWe will focus on subthreshold

Gate leakage has essentially been ignored Gate leakage has essentially been ignored New gate insulation materials may solve problem, eg recent Intel New gate insulation materials may solve problem, eg recent Intel

announcementannouncement• R. Chau, Technology@intel Magazine. www.intel.comR. Chau, Technology@intel Magazine. www.intel.com

Gate-induced drain leakage (GIDL) occurs at negative gate voltages Gate-induced drain leakage (GIDL) occurs at negative gate voltages and high Vdd or high values of reverse body biasand high Vdd or high values of reverse body bias

t

offthv

V

tVddVddba

OXleakage vn

VVeve

L

WCI t

dd

02)*(0 exp10

Page 71: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7171

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Effects of Parameter VariationsEffects of Parameter Variations

IIoffoff depends exponentially on V depends exponentially on V thth

There is a large fluctuation of IThere is a large fluctuation of Ioffoff from die to die and from gate to gate from die to die and from gate to gate Controlling VControlling Vthth is difficult in nanometer scale is difficult in nanometer scale

Drain-induced barrier loweringDrain-induced barrier lowering• Channel length is not constantChannel length is not constant• Exacerbated in sub-100nm devicesExacerbated in sub-100nm devices

Discrete dopant effectsDiscrete dopant effects• In a very small channel, small number of dopantsIn a very small channel, small number of dopants• Presence of these dopants and random fluctuation of their number, lead to Presence of these dopants and random fluctuation of their number, lead to

changes in Vchanges in Vthth from device to device from device to device

Process variation affectsProcess variation affects Gate length (LGate length (Ldrawndrawn)) Gate oxide thickness (TGate oxide thickness (Toxox)) Channel dose (NChannel dose (Nsubsub))

Srivastava et al, ISLPED 2002

Page 72: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7272

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Static PowerStatic Power MotivationMotivation

Growing relative to dynamic power dissipation: soon 50% of total Growing relative to dynamic power dissipation: soon 50% of total powerpower

Exponentially dependent on Temp, Vth, VddExponentially dependent on Temp, Vth, Vdd Natural target for optimization: idle transistorsNatural target for optimization: idle transistors

Static power/ Dynamic Power

010203040506070

Temperature(K)

Pe

rce

nta

ge

180nm 130nm 100nm 90nm 80nm 70nm

Increasingratioacrossgenerations

Source: Skadron et al, University of Virginia

Page 73: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7373

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Static PowerStatic Power Modeling LeakageModeling Leakage

Butts and Sohi (MICRO-33)Butts and Sohi (MICRO-33)• PPstaticstatic = V = Vcccc · N · k· N · kdesigndesign · Î · Îleakleak

• ÎÎleakleak determined by circuit simulation, k determined by circuit simulation, kdesigndesign empirically empirically• Key contribution: separate technology from designKey contribution: separate technology from design

HotLeakage (UVA TR CS-2003-05, DATE’04)HotLeakage (UVA TR CS-2003-05, DATE’04)• Extension of Butts & Sohi approach: scalable with VExtension of Butts & Sohi approach: scalable with Vdddd, V, Vthth, ,

Temp, and technology node; adds gate leakageTemp, and technology node; adds gate leakage• ÎÎleak leak determined by BSIM3 subthreshold equation and BSIM4 determined by BSIM3 subthreshold equation and BSIM4

gate-leakage equations, giving an analytical expression that gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at accounts for dependence on factors that may change at runtime, namely Vruntime, namely Vdddd, V, Vthth, and Temp , and Temp

• kkdesigndesign replaced by separate factors for N- and P-type transistors replaced by separate factors for N- and P-type transistors• kkdesigndesign also exponentially dependent on also exponentially dependent on VVdd dd and Tand Toxox, linearly , linearly

dependent on Tempdependent on Temp• Currently integrated with SimpleScalar/Wattch for cachesCurrently integrated with SimpleScalar/Wattch for caches

Page 74: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7474

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Static PowerStatic Power

Modeling Leakage (cont.)Modeling Leakage (cont.) Su et al, IBM (ISLPED’03)Su et al, IBM (ISLPED’03)

• Similar approach to HotLeakage – but they observe that Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the modeling the change in leakage allows linearization of the equationsequations

Many, many other papers on various aspects of Many, many other papers on various aspects of modeling different aspects of leakagemodeling different aspects of leakage

• Most focus on subthresholdMost focus on subthreshold• Few suggest how to model leakage in microarchitecture Few suggest how to model leakage in microarchitecture

simulationssimulations

Page 75: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7575

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Circuit/architectural level techniquesCircuit/architectural level techniques

Transistor sizingTransistor sizing Dual VDual Vthth

DVSDVS Dynamic threshold voltage – reverse body biasDynamic threshold voltage – reverse body bias Sleep transistorsSleep transistors Low leakage caches/branch predictorsLow leakage caches/branch predictors Low leakage register fileLow leakage register file Low leakage issue queueLow leakage issue queue Low leakage ALUsLow leakage ALUs Techniques for reducing gate leakageTechniques for reducing gate leakage What else?What else?

Page 76: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7676

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Transistor sizing, Dual-VthTransistor sizing, Dual-Vth

Transistor sizingTransistor sizing Reducing W/L reduces leakage: use smallest possible Reducing W/L reduces leakage: use smallest possible

transistorstransistors Leakage-performance tradeoffLeakage-performance tradeoff

Dual-VthDual-Vth High-threshold transistors dramatically reduce High-threshold transistors dramatically reduce

leakage: use low-Vth on critical paths, high-Vth leakage: use low-Vth on critical paths, high-Vth elsewhereelsewhere

Often suggested in caches: many possible Often suggested in caches: many possible permutationspermutations

DVSDVS Leakage is exponentially dependent on Vdd, soLeakage is exponentially dependent on Vdd, so

DVS reduces leakageDVS reduces leakage

Page 77: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7777

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Dynamic Threshold VoltageDynamic Threshold Voltage

Adjust threshold voltage dynamicallyAdjust threshold voltage dynamically Also called reverse body bias (RBB), auto backgate-Also called reverse body bias (RBB), auto backgate-

controlled multi-threshold CMOS (ABB-MTCMOS) controlled multi-threshold CMOS (ABB-MTCMOS) (Nii et al, ISPLED’98)(Nii et al, ISPLED’98)

Apply negative voltage to body: requires larger VApply negative voltage to body: requires larger VGSGS to to establish channel, so it raises Vthestablish channel, so it raises Vth

Engage RBB for idle transistorsEngage RBB for idle transistors Preserves statePreserves state Requires twin-well process; more expensive to Requires twin-well process; more expensive to

manufacturemanufacture Limited by GIDLLimited by GIDL Can also be used at testing to adjust circuit properties Can also be used at testing to adjust circuit properties

and reduce parameter variationsand reduce parameter variations

Page 78: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7878

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Sleep TransistorsSleep Transistors

Add a high-Vth transistor between the Add a high-Vth transistor between the circuit and either/both power rails – the circuit and either/both power rails – the sleep transistorsleep transistor

Also referred to as a “header” (to Vdd) or Also referred to as a “header” (to Vdd) or “footer” (to ground)“footer” (to ground)

The high-Vth transistor cuts off most The high-Vth transistor cuts off most leakageleakage

In fact, a properly sized, lower-Vth In fact, a properly sized, lower-Vth footer transistor can preserve enough footer transistor can preserve enough leakage to keep the cell active (Li et leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)al, PACT’02; Agarwal et al, DAC’02)

Great care must be taken when switching Great care must be taken when switching back to full voltage: noise can flip bitsback to full voltage: noise can flip bits

Extra latency may be necessary when re-Extra latency may be necessary when re-activatingactivating

Page 79: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

7979

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Low-Leakage CachesLow-Leakage Caches Gated-VGated-Vdddd/V/Vssss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28) (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)

Uses sleep transistor on VUses sleep transistor on Vdddd/ground for each cache line/ground for each cache line Typically considered non-state-preserving, but recent work (Agarwal et al, Typically considered non-state-preserving, but recent work (Agarwal et al,

DAC’02) suggests that gated-VDAC’02) suggests that gated-Vss ss it may preserve stateit may preserve state Many algorithms for determining when to gateMany algorithms for determining when to gate Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay

intervalinterval Adaptive decay intervals - hardAdaptive decay intervals - hard

Drowsy cache (Flautner et al, ISCA-29)Drowsy cache (Flautner et al, ISCA-29) Uses dual supply voltages: normal Vdd and a low Vdd close to the Uses dual supply voltages: normal Vdd and a low Vdd close to the

threshold voltagethreshold voltage State preserving, but requires an extra cycle to wake up – two extra cycles State preserving, but requires an extra cycle to wake up – two extra cycles

if tags are decayedif tags are decayed State preservation using leakage currents (Li et al, PACT’02; Agarwal State preservation using leakage currents (Li et al, PACT’02; Agarwal

et al, DAC’02)et al, DAC’02) Similar to gated-Vss but designed to keep supply voltage high enough to Similar to gated-Vss but designed to keep supply voltage high enough to

preserve state (100-120 mV)preserve state (100-120 mV)

Page 80: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8080

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Low Leakage Caches, cont.Low Leakage Caches, cont.

Comparison (Parikh, Li, et al, WDDD’03, DATE’04)Comparison (Parikh, Li, et al, WDDD’03, DATE’04) Compared non-state-preserving gated-Vss with state-preserving Compared non-state-preserving gated-Vss with state-preserving

drowsy cachedrowsy cache If gating is state-preserving, it wins because it essentially If gating is state-preserving, it wins because it essentially

eliminates subthreshold and gate leakageeliminates subthreshold and gate leakage• Unless wakeup time is significantly longer than with drowsyUnless wakeup time is significantly longer than with drowsy

Otherwise, drowsy cache typically has an advantage because it Otherwise, drowsy cache typically has an advantage because it is state preserving; no L2 accesses needed on “induced misses”is state preserving; no L2 accesses needed on “induced misses”

But induced misses are rare, so for a reasonable range of on-But induced misses are rare, so for a reasonable range of on-chip L2 penalties (< 8 cycles in our studies), gating can still be chip L2 penalties (< 8 cycles in our studies), gating can still be superiorsuperior

Page 81: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8181

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Low-Leakge Caches, cont: 4T CellsLow-Leakge Caches, cont: 4T Cells

4T-based branch predictors, caches4T-based branch predictors, caches Hu , Juang, et al, ISLPED’02, Hu , Juang, et al, ISLPED’02,

CA-Letters’02CA-Letters’02 Non state-preservingNon state-preserving Decay rate : temperature-dependentDecay rate : temperature-dependent

• Can be adjusted with passivesCan be adjusted with passives Eliminates decay state bitsEliminates decay state bits

4 transistor cells 4 transistor cells [ 4T ][ 4T ] Eliminates two Eliminates two

transistors connected transistors connected to Vddto Vdd

Naturally decays Naturally decays over timeover time

Refreshes upon Refreshes upon accessaccess

When decayed, force When decayed, force default outputdefault output

Up to 33% smaller Up to 33% smaller than equivalent 6Tthan equivalent 6T

Decays quickly [8K Decays quickly [8K cycles at 1 GHz]cycles at 1 GHz]

Leak only as much Leak only as much energy as is energy as is depositeddeposited

6T (left) and 4T (right) circuit diagrams

Page 82: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8282

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Low-Leakage Caches, cont: Low-Leakage Caches, cont: Other TechniquesOther Techniques

RBB (Nii et al, ISLPED’98)RBB (Nii et al, ISLPED’98) Back bias cache lines that are idle – can use the same Back bias cache lines that are idle – can use the same

decay counters as gated-Vdd/Vssdecay counters as gated-Vdd/Vss

Leakage-biased bitlines (Heo et al, ISCA-29)Leakage-biased bitlines (Heo et al, ISCA-29) Disable precharge and let the bitlines float: they will Disable precharge and let the bitlines float: they will

settle to a value that minimizes leakagesettle to a value that minimizes leakage Can only be applied to idle subbanks and requires Can only be applied to idle subbanks and requires

accurate prediction of which subbank will be accessedaccurate prediction of which subbank will be accessed

Huge variety of other techniques – this is only an Huge variety of other techniques – this is only an overview of some of the major onesoverview of some of the major ones

Page 83: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8383

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Register FilesRegister Files

In general, state-preserving techniques for In general, state-preserving techniques for caches may work for register files toocaches may work for register files too

Leakage-biased bitlines work here tooLeakage-biased bitlines work here too Register file divided into subbanksRegister file divided into subbanks

Alvandpour et al, Intel, ISLPED’01Alvandpour et al, Intel, ISLPED’01 Uses dual Vth and a Uses dual Vth and a conditional keeperconditional keeper

• ““Keeper” used on dynamic circuits to counteract voltage Keeper” used on dynamic circuits to counteract voltage droop due to leakage – they constitute a static pull-up pathdroop due to leakage – they constitute a static pull-up path

• Dynamic circuits arise in the muxes due to multiportingDynamic circuits arise in the muxes due to multiporting• ““Conditional” keeper technique uses two cascaded keepers; Conditional” keeper technique uses two cascaded keepers;

one is fixed and the other only engaged when needed to one is fixed and the other only engaged when needed to drive an output – requires careful timing analysisdrive an output – requires careful timing analysis

Access transistors and keepers are high-Vt/Access transistors and keepers are high-Vt/

Page 84: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8484

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

ALUsALUs

Usually Dual-VUsually Dual-VTT domino logic domino logic Area & SpeedArea & Speed

Sleep transistors can be used but it has a costSleep transistors can be used but it has a cost Dynamic nodes are dischargedDynamic nodes are discharged Can be used if worthyCan be used if worthy

Dropsho et al, MICRO 2002

Page 85: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8585

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Other TechniquesOther Techniques

Queues (eg, issue queues)Queues (eg, issue queues) Various occupancy-based or rate-matching Various occupancy-based or rate-matching

techniques have been proposed for issue queue techniques have been proposed for issue queue resizing. resizing.

Deactivating queue entries reduces leakageDeactivating queue entries reduces leakage eg, Ponomarev et al, MICRO-34eg, Ponomarev et al, MICRO-34

Compiler techniquesCompiler techniques When compiler knows that regions are idle, they can When compiler knows that regions are idle, they can

be deactivatedbe deactivated eg, Zhang et al, MICRO-35eg, Zhang et al, MICRO-35

Page 86: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8686

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Gate LeakageGate Leakage

Any technique that reduces VddAny technique that reduces Vdd Otherwise it seems difficult to develop architecture Otherwise it seems difficult to develop architecture

techniques that directly attack gate leakagetechniques that directly attack gate leakage In fact, very little work has been done in this areaIn fact, very little work has been done in this area

One example: domino gates (Hamzaoglu & Stan, One example: domino gates (Hamzaoglu & Stan, ISLPED’02)ISLPED’02)

Replace traditional NMOS pull-down network with a PMOS pull-Replace traditional NMOS pull-down network with a PMOS pull-up networkup network

Gate leakage is greater in NMOS than PMOSGate leakage is greater in NMOS than PMOS But PMOS domino gate is slowerBut PMOS domino gate is slower

Page 87: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8787

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

RoadmapRoadmap

Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation

Sources, modeling, reduction techniquesSources, modeling, reduction techniques

Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques

SummarySummary

Page 88: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8888

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Other Power-Related IssuesOther Power-Related Issues

ThermalThermal Managing on-chip temperatures (as opposed to Managing on-chip temperatures (as opposed to

average heat dissipation) is not just a matter of average heat dissipation) is not just a matter of reducing average power densityreducing average power density

Spatial and temporal variationSpatial and temporal variation• Spatial: hot spots—must reduce power density in the right Spatial: hot spots—must reduce power density in the right

placesplaces• Temporal: must reduce power when chip is hotTemporal: must reduce power when chip is hot

This is often when there is less slackThis is often when there is less slack Most model temperature directly Most model temperature directly

• Average power metrics do not accurately predict temperatureAverage power metrics do not accurately predict temperature (Skadron et al, ISCA’03)(Skadron et al, ISCA’03)

Page 89: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

8989

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

Other Power-Related IssuesOther Power-Related Issues

Voltage stability (dI/dt)Voltage stability (dI/dt) Inductance means that abrupt changes in current can Inductance means that abrupt changes in current can

cause voltage droopcause voltage droop This can be addressed with decoupling capacitance, This can be addressed with decoupling capacitance,

but required capacitance is becoming expensivebut required capacitance is becoming expensive Grochowski et al HPCA’02Grochowski et al HPCA’02, Joseph et al, HPCA’03, Joseph et al, HPCA’03

Page 90: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

9090

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

RoadmapRoadmap

Introduction & TrendsIntroduction & Trends Dynamic Power DissipationDynamic Power Dissipation

Sources, modeling, reduction techniquesSources, modeling, reduction techniques

Static Power DissipationStatic Power Dissipation Sources, modeling, reduction techniquesSources, modeling, reduction techniques

SummarySummary

Page 91: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

9191

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

SummarySummary

Power dissipation is becoming a huge concernPower dissipation is becoming a huge concern Total power budgetTotal power budget Power density (thermal)Power density (thermal) Energy consumption & battery lifeEnergy consumption & battery life

Power dissipationPower dissipation SwitchingSwitching Short-circuitShort-circuit LeakageLeakage

Power modeling crucialPower modeling crucial Academia: accurate researchAcademia: accurate research Industry: detect hot spots on time to meet PORIndustry: detect hot spots on time to meet POR

Page 92: © 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin SkadronJose Gonzalez University.

9292

© 2

004,

Kev

in S

kadr

on a

nd J

ose

Gon

zale

z

SummarySummary

Reducing dynamic powerReducing dynamic power Circuits perspectiveCircuits perspective

• Energy-effective access (reducing capacitance or driving Energy-effective access (reducing capacitance or driving voltage)voltage)

• GatingGating Architectural perspectiveArchitectural perspective

• Decreasing activity factorDecreasing activity factor• Pipeline gatingPipeline gating• Adjusting voltage/frequency to meet application requirementsAdjusting voltage/frequency to meet application requirements

Reducing static powerReducing static power• Dual VDual Vthth

• Non-state-preserving vs. state-preserving techniquesNon-state-preserving vs. state-preserving techniques