Download - Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14 1 ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani.

Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14ELEC6270 Fall 07, Lecture 14 11

ELEC 5270/6270 Fall 2007ELEC 5270/6270 Fall 2007Low-Power Design of Electronic CircuitsLow-Power Design of Electronic Circuits

Power Aware MicroprocessorsPower Aware Microprocessors

Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor

Dept. of Electrical and Computer EngineeringDept. of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849

[email protected]://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Fall07/course.html


SIA Roadmap for Processors (1999)SIA Roadmap for Processors (1999)YearYear 19991999 20022002 20052005 20082008 20112011 20142014

Feature size (nm)Feature size (nm) 180180 130130 100100 7070 5050 3535

Logic transistors/cmLogic transistors/cm22 6.2M6.2M 18M18M 39M39M 84M84M 180M180M 390M390M

Clock (GHz)Clock (GHz) 1.251.25 2.12.1 3.53.5 6.06.0 10.010.0 16.916.9

Chip size (mmChip size (mm22)) 340340 430430 520520 620620 750750 900900

Power supply (V)Power supply (V) 1.81.8 1.51.5 1.21.2 0.90.9 0.60.6 0.50.5

High-perf. Power (W)High-perf. Power (W) 9090 130130 160160 170170 175175 183183

Source: http://www.semichips.org


Power Reduction in ProcessorsPower Reduction in Processors

Just about everything is used.Just about everything is used. Hardware methods:Hardware methods:

Voltage reduction for dynamic powerVoltage reduction for dynamic power Dual-threshold devices for leakage reductionDual-threshold devices for leakage reduction Clock gating, frequency reductionClock gating, frequency reduction Sleep modeSleep mode

Architecture:Architecture: Instruction setInstruction set hardware organizationhardware organization

Software methodsSoftware methods


SPEC CPU2000 BenchmarksSPEC CPU2000 Benchmarks Twelve integer and 14 floating point programs, Twelve integer and 14 floating point programs,

CINT2000CINT2000 and and CFP2000CFP2000.. Each program run time is normalized to obtain a Each program run time is normalized to obtain a

SPEC ratioSPEC ratio with respect to the run time of with respect to the run time of Sun Sun Ultra 5_10 with a 300MHz processorUltra 5_10 with a 300MHz processor..

CINT2000CINT2000 and and CFP2000CFP2000 summary summary measurements are the geometric means of measurements are the geometric means of SPEC ratios.SPEC ratios.

LINPACK is numerically intensive floating point LINPACK is numerically intensive floating point linear system (Ax = b) program used for linear system (Ax = b) program used for benchmarking supercomputers.benchmarking supercomputers.


Reference CPU s: Sun Ultra 5_10 Reference CPU s: Sun Ultra 5_10 300MHz Processor300MHz Processor

0

500

1000

1500

2000

2500

3000

3500g

zip

vp

rg

cc

mc

fc

raft

yp

ars

er

eo

np

erl

bm

kg

ap

vo

rte

xb

zip

2tw

olf

wu

pw

ise

sw

imm

gri

da

pp

lum

es

ag

alg

el

art

eq

ua

ke

fac

ere

ca

mm

plu

ca

sfm

a3

ds

ixtr

ac

ka

ps

i

CINT2000

CFP2000


CINT2000: 3.4GHz Pentium 4, HT CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard)Technology (D850MD Motherboard)

0

500

1000

1500

2000

2500g

zip

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

Base ratio

Opt. ratio

SPECint2000_base = 1341SPECint2000 = 1389

Source: www.spec.org


Two Benchmark ResultsTwo Benchmark Results

Baseline: A uniform configuration not Baseline: A uniform configuration not optimized for specific program:optimized for specific program:

Same compiler with same settings and flags used Same compiler with same settings and flags used for all benchmarksfor all benchmarks

Other restrictionsOther restrictions

Peak: Run is optimized for obtaining the Peak: Run is optimized for obtaining the peak performance for each benchmark peak performance for each benchmark program.program.


CFP2000: 3.6GHz Pentium 4, HT Technology CFP2000: 3.6GHz Pentium 4, HT Technology (D925XCV/AA-400 Motherboard)(D925XCV/AA-400 Motherboard)

0

500

1000

1500

2000

2500

3000w

up

wis

esw

im

mg

rid

app

lum

esa

gal

gel art

equ

ake

face

rec

amm

plu

cas

fma3

dsi

xtra

ck

apsi

Base ratio

Opt. ratio

SPECfp2000_base = 1627SPECfp2000 = 1630



CINT2000: 1.7GHz Pentium 4CINT2000: 1.7GHz Pentium 4(D850MD Motherboard)(D850MD Motherboard)

0100200300400500600700800900

1000g

zip

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

Base ratio

Opt. ratio

SPECint2000_base = 579SPECint2000 = 588



CFP2000: 1.7GHz Pentium 4 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard)(D850MD Motherboard)

0

200

400

600

800

1000

1200

1400w

up

wis

esw

im

mg

rid

app

lum

esa

gal

gel art

equ

ake

face

rec

amm

plu

cas

fma3

dsi

xtra

ck

apsi

Base ratio

Opt. ratio

SPECfp2000_base = 648SPECfp2000 = 659



Energy SPEC BenchmarksEnergy SPEC Benchmarks

Energy efficiency mode: Besides the Energy efficiency mode: Besides the execution time, energy efficiency of SPEC execution time, energy efficiency of SPEC benchmark programs is also measured. benchmark programs is also measured. Energy efficiency of a benchmark program Energy efficiency of a benchmark program is given by:is given by:

1/(Execution 1/(Execution time)time)Energy efficiency Energy efficiency == ────────────────────────

joules joules consumedconsumed


Energy EfficiencyEnergy Efficiency

Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:

nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))

1/1/nn

ii=1=1

where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..

Relative efficiency:Relative efficiency:

Efficiency of a computerEfficiency of a computerRelative efficiency = Relative efficiency = ──────────────────────────────────

Eff. of reference Eff. of reference computercomputer


SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency

0

1

2

3

4

5

6

SP

EC

INT

20

00

SP

EC

FP

20

00

SP

EC

INT

20

00

SP

EC

FP

20

00

SP

EC

INT

20

00

SP

EC

FP

20

00

Pentium [email protected]/0.6GHz Energy-efficient procesor

Pentium [email protected] (Reference)

Pentium [email protected]

Always max. clock

Laptop adaptive clk.

Min. power min. clock


Voltage ScalingVoltage Scaling

Dynamic: Reduce voltage and frequency Dynamic: Reduce voltage and frequency during idle or low activity periods.during idle or low activity periods.

Static: Static: Clustered voltage scalingClustered voltage scalingLogicLogic on non-critical paths given lower voltage.on non-critical paths given lower voltage.47% power reduction with 10% area increase 47% power reduction with 10% area increase

reported.reported.M. Igarashi et al., “Clustered Voltage Scaling M. Igarashi et al., “Clustered Voltage Scaling

Techniques for Low-Power Design,” Techniques for Low-Power Design,” Proc. IEEE Proc. IEEE Symp. Low Power DesignSymp. Low Power Design, 1997., 1997.


Processor UtilizationProcessor UtilizationThroughput = Operations / second

Th

rou

ghp

ut

Time

Compute-intensiveprocesses

Systemidle

Low throughput(background)

processes

Maximumthroughput


Examples of ProcessesExamples of Processes

Compute-intensive: spreadsheet, spelling Compute-intensive: spreadsheet, spelling check, video decoding, scientific check, video decoding, scientific computing.computing.

Low throughput: data entry, screen Low throughput: data entry, screen updates, low bandwidth I/O data transfer.updates, low bandwidth I/O data transfer.

Idle: no computation, no expected output.Idle: no computation, no expected output.


Effects of Voltage ReductionEffects of Voltage Reduction

Voltage reduction increases delay, Voltage reduction increases delay, decreases throughput:decreases throughput:

Slow reduction in throughput at firstSlow reduction in throughput at firstRapid reduction in throughput for VRapid reduction in throughput for VDD ≤ V≤ Vth

Time per operation (TPO) increasesTime per operation (TPO) increases

Voltage reduction continues to reduce Voltage reduction continues to reduce power consumption:power consumption:

Energy per operation (EPO) = Power × TPOEnergy per operation (EPO) = Power × TPO


Energy per Operation (EPO)Energy per Operation (EPO)

VVDD / V / Vth

1 2 3 4 5

PowerTPO

EPO

1.0

0.5

0.0


Dynamic Voltage and ClockDynamic Voltage and Clock

ThroughputThroughputTime spent in:Time spent in:

Battery Battery lifelifeFast Fast

modemodeSlow Slow modemode

Idle Idle modemode

Always full speedAlways full speed 10%10% 0%0% 90%90% 1 hr1 hr

Sometimes full speedSometimes full speed 1%1% 90%90% 9%9% 5.3 hrs5.3 hrs

Rarely full speedRarely full speed 0.1%0.1% 99%99% 0.9%0.9% 9.2 hrs9.2 hrs

T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,Springer, 2002, pp. 35-36.


Problem of Process Variation and Problem of Process Variation and LeakageLeakage

Lower Vth Vth Higher Vth

Nu

mb

er

of c

hip

s

Powerspecification

Clockspecification

From a presentation:Power Reduction using LongRun2 in Transmeta’sEfficon Processor, by D. DitzelMay 17, 2006

Yield lossdue to highleakage

Yield lossdue to slowspeed


Pipeline GatingPipeline Gating A pipeline processor uses speculative execution.A pipeline processor uses speculative execution.

Incorrect branch prediction results in pipeline stalls and Incorrect branch prediction results in pipeline stalls and wasted energy.wasted energy.

Idea: Stop fetching instructions if a branch Idea: Stop fetching instructions if a branch hazard is expected:hazard is expected:

If the count (M) of incorrect predictions exceeds a pre-If the count (M) of incorrect predictions exceeds a pre-specified number (N), then suspend fetching instruction for specified number (N), then suspend fetching instruction for some k cycles.some k cycles.

Ref.: S. Manne, A. Klauser and D. Grunwald, Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy “Pipeline Gating: Speculation Control for Energy Reduction,” Reduction,” Proc. 25Proc. 25thth Annual International Annual International Symp. Computer ArchitectureSymp. Computer Architecture, June 1998., June 1998.


Slack SchedulingSlack Scheduling Application: Superscalar, out-of-order execution:Application: Superscalar, out-of-order execution:

An instruction is executed as soon as the required data and An instruction is executed as soon as the required data and resources become available.resources become available.

A commit unit reorders the results.A commit unit reorders the results.

Delay the completion of instructions whose result Delay the completion of instructions whose result is not immediately needed.is not immediately needed.

Example of RISC instructions:Example of RISC instructions: addadd r0, r1, r2;r0, r1, r2; (A)(A) sub r3, r4, r5;sub r3, r4, r5; (B)(B) and r9, x1, r9;and r9, x1, r9; (C)(C) or r5, r9, r10;or r5, r9, r10; (D)(D) xor r2, r10, r11;xor r2, r10, r11; (E)(E)

J. Casmira and D. Grunwald,“Dynamic Instruction SchedulingSlack,” Proc. ACM Kool ChipsWorkshop, Dec. 2000.


Slack Scheduling ExampleSlack Scheduling Example

Slack schedulingSlack scheduling

AABB CC

DD

EE

Standard schedulingStandard scheduling

AA BB CC

DD

EE


Slack SchedulingSlack Scheduling

Slack bitLow-power

execution units

Re-order buffer

Sch

edul

ing

logi

c


Clock DistributionClock Distribution

clock


Clock PowerClock Power

Pclk = CLVDD2f + CLVDD

2f / λ + CLVDD2f / λ2 + . . .

stages – 1 1= CLVDD

2f Σ ─ n = 0 λn

where CL = total load capacitance

λ = constant fanout at each stage in distributionnetwork

Clock consumes about 40% of total processor power.


Clock Network ExamplesClock Network ExamplesAlpha 21064Alpha 21064 Alpha 21164Alpha 21164 Alpha 21264Alpha 21264

TechnologyTechnology 0.750.75μμ CMOS CMOS 0.50.5μμ CMOS CMOS 0.350.35μμ CMOS CMOS

Frequency (MHz)Frequency (MHz) 200200 300300 600600

Total capacitanceTotal capacitance 12.5nF12.5nFClock gating Clock gating used. Total used. Total power 80 -power 80 -

110W110W

Clock loadClock load 3.25nF3.25nF 3.75nF3.75nF

Clock powerClock power 40%40% 40% (20W)40% (20W)

Max. clock skewMax. clock skew 200ps (<10%)200ps (<10%) 90ps90ps

D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.


Power Reduction ExamplePower Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation =Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) =Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) =Eliminate FP, power (3x) = 1.6W Scale 0.75→0.35Scale 0.75→0.35μμ, power (2x) =, power (2x) = 0.8W Reduce clock load, power (1.3x) =Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200→160MHz, power (1.25x) =Reduce frequency 200→160MHz, power (1.25x) = 0.5W

J. Montanaro J. Montanaro et alet al., “A 160-MHz, 32-b, 0.5-W CMOS RISC ., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” Microprocessor,” IEEE J. Solid-State CircuitsIEEE J. Solid-State Circuits, vol. 31, no. , vol. 31, no. 11, pp. 1703-1714, Nov. 1996.11, pp. 1703-1714, Nov. 1996.


Parallel ArchitectureParallel Architecture

Processor

f

Processor

f/2

Processor

f/2

f

Input Output

Input

Output

Capacitance = CVoltage = VFrequency = fPower = CV2f

Capacitance = 2.2CVoltage = 0.6VFrequency = 0.5fPower = 0.396CV2f


Pipeline ArchitecturePipeline Architecture

Processor

f

Input Output

Re

gis

ter

½Proc.

f

Input Output

Re

gis

ter

½Proc.

Re

gis

ter

Capacitance = CVoltage = VFrequency = fPower = CV2f

Capacitance = 1.2CVoltage = 0.6VFrequency = fPower = 0.432CV2f


Approximate TrendApproximate Trend n-parallel proc.n-parallel proc. n-stage pipeline proc.n-stage pipeline proc.

CapacitanceCapacitance nCnC CC

VoltageVoltage V/nV/n V/nV/n

FrequencyFrequency f/nf/n ff

PowerPower CVCV22f/nf/n22 CVCV22f/nf/n22

Chip areaChip area n timesn times 10-20% increase10-20% increase

G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: KluwerAcademic Publishers, 1998.


For More on MicroprocessorsFor More on Microprocessors

T. D. Burd and R. W. Brodersen, Energy T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, Efficient Microprocessor Design, Springer, 2002.2002.

R. Graybill and R. Melhem, R. Graybill and R. Melhem, Power Aware Power Aware ComputingComputing, New York: Plenum Publishers, , New York: Plenum Publishers, 2002.2002.