Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 141 ELEC 5270/6270 Spring 2011 Low-Power...

of 37 /37
Copyright Agrawal, 2007 Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 ELEC5270/6270 Spring 11, Lecture 14 1 ELEC 5270/6270 Spring 2011 ELEC 5270/6270 Spring 2011 Low-Power Design of Electronic Low-Power Design of Electronic Circuits Circuits Power Aware Microprocessors Power Aware Microprocessors Vishwani D. Agrawal Vishwani D. Agrawal James J. Danaher Professor James J. Danaher Professor Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 Auburn University, Auburn, AL 36849 [email protected] http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr11/course.html

Embed Size (px)

Transcript of Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 141 ELEC 5270/6270 Spring 2011 Low-Power...

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*ELEC 5270/6270 Spring 2011Low-Power Design of Electronic Circuits

    Power Aware MicroprocessorsVishwani D. AgrawalJames J. Danaher ProfessorDept. of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849

    [email protected]://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr11/course.html

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*SIA Roadmap for Processors (1999)Source: http://www.semichips.org Untrue predictions.

    Year199920022005200820112014Feature size (nm)180130100705035Logic transistors/cm26.2M18M39M84M180M390MClock (GHz)1.252.13.56.010.016.9Chip size (mm2)340430520620750900Power supply (V)1.81.51.20.90.60.5High-perf. Power (W)90130160170175183

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Power Reduction in ProcessorsHardware methods:Voltage reduction for dynamic powerDual-threshold devices for leakage reductionClock gating, frequency reductionSleep modeArchitecture:Instruction sethardware organizationSoftware methods

    ELEC5270/6270 Spring 11, Lecture 14

  • Performance CriteriaThroughput computations per unit time.Performance is inverse of time increasing CPU time indicates lower performance.Power computations per watt.Energy efficiency performance/joule.Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*SPEC CPU2006 BenchmarksStandard Performance Evaluation Corporation (SPEC)http://www.spec.orgTwelve integer and 17 floating point programs, CINT2006 and CFP2006.Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processor.It takes about 12 days to run all benchmarks on reference system.CINT2006 and CFP2006 metrics are the geometric means of SPEC ratios:Peak metric each program is individually optimized (aggressive compilation).Base metric common optimization for all programs.

    ELEC5270/6270 Spring 11, Lecture 14

  • SPEC CINT2006 Resultshttp://www.spec.org/cpu2006/results/cint2006.htmlDell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Performance metric 36.6 base, 39.4 peakDell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Performance metric 15.8 base,19.1 peakCopyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • SPEC CFP2006 Resultshttp://www.spec.org/cpu2006/results/cfp2006.htmlDell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Performance metric 42.5 base, 45.8 peakDell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Performance metric 17.4 base,21.5 peakCopyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Other BenchmarksLINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers.SPECPOWER_ssj2008 measures power and performance of a computer system.The initial benchmark addresses the performance of server-side Java; additional workloads are planned.http://www.spec.org/benchmarks.html#power

    ELEC5270/6270 Spring 11, Lecture 14

  • Second Quarter 2010 SPECpower_ssj2008 Results

    http://www.spec.org/power_ssj2008/results/res2010q2/Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7CPU: AMD Opteron 6174, 2.2GHzNumber of chips 2, cores 12, threads/core 2Total memory 16GBssj operations @ 100% 888,819Average power @ 100% 271 WAverage power @ active idle 101 WOverall ssj operations per watt 2,355Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Second Quarter 2010 SPECpower_ssj2008 Results

    http://www.spec.org/power_ssj2008/results/res2010q2/May 19, 2010: Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads 2Total memory 12GBssj operations @ 100% 914,076Average power @ 100% 244 WAverage power @ active idle 62.3 WOverall ssj operations per watt 2,938Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Energy SPEC BenchmarksEnergy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by:1/(Execution time)Energy efficiency = joules consumed

    D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Hardware/Software Interface, 4th Edition, Morgan Kaufmann Publishers (Elsevier), 2009,

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Energy EfficiencyEfficiency averaged on n benchmark programs: nEfficiency=( Efficiencyi )1/n i=1where Efficiencyi is the efficiency for program i.Relative efficiency: Efficiency of a computerRelative efficiency = Eff. of reference computer

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*SPEC2000 Relative Energy EfficiencyAlways max. clockLaptop adaptive clk.Min. power min. clock

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Voltage ScalingDynamic: Reduce voltage and frequency during idle or low activity periods.Static: Clustered voltage scalingLogic on non-critical paths given lower voltage.47% power reduction with 10% area increase reported.M. Igarashi et al., Clustered Voltage Scaling Techniques for Low-Power Design, Proc. IEEE Symp. Low Power Design, 1997.

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Processor UtilizationThroughput = Operations / secondThroughputTimeCompute-intensiveprocessesSystemidleLow throughput(background)processesMaximumthroughput

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Examples of ProcessesCompute-intensive: spreadsheet, spelling check, video decoding, scientific computing.Low throughput: data entry, screen updates, low bandwidth I/O data transfer.Idle: no computation, no expected output.

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Effects of Voltage ReductionVoltage reduction increases delay, decreases throughput:Slow reduction in throughput at firstRapid reduction in throughput for VDD VthTime per operation (TPO) increasesVoltage reduction continues to reduce power consumption:Energy per operation (EPO) = Power TPO

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Energy per Operation (EPO)VDD / Vth12345PowerTPOEPO1.0

    0.5

    0.0

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Dynamic Voltage and ClockT. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,Springer, 2002, pp. 35-36.

    ThroughputTime spent in:Battery lifeFast modeSlow modeIdle modeAlways full speed10%0%90%1 hrSometimes full speed1%90%9%5.3 hrsRarely full speed0.1%99%0.9%9.2 hrs

    ELEC5270/6270 Spring 11, Lecture 14

  • Example: Find Minimum Energy ModeProcessor data (rated operation):2 GHz clock1.5 volt supply voltage0.5 volt threshold voltagePower consumption50 watts dynamic power50 watts static powerMaximum clock frequency for V volt supplyf(V VTH)/VCopyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Example Cont.Dynamic power:Pd = CV2f = C(1.5)22109 = 50WC = 11.11 nF, capacitance switching/cyclePd = 11.11 V2fDynamic energy per cycle:Ed = Pd/f = 11.11 V2Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Example Cont.Clock frequency:f = k (V VTH)/V = k (1.5 0.5)/1.5 = 2 GHzk = 3 GHz, a proportionality constantf = 3(V 0.5)/VGHzCopyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Example Cont.Static power:Ps = k V2 = k (1.5)2 = 50Wk = 22.22 mho, total leakage conductancePs = 22.22 V2Static energy per cycle:Es = Ps/f = 22.22 V3/[3(V 0.5)]= 7.41 V3/(V 0.5)Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Example Cont.Total energy per cycle:E = Ed + Es = 11.11 V2 + 7.41 V3/(V 0.5)To minimize E, E/V = 0, or5V2 4.6V + 0.75 = 0Solutions of quadratic equation:V = 0.679 volt, 0.221 voltDiscard second solution, which is lower than the threshold voltage of 0.5 volt.Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Example: ResultCopyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    Rated modeLow energy modeReduction (%)Voltage1.5 V0.679 V54.7%Clock frequency2 GHz791 MHz60%Dynamic energy/cycle25.00 nJ5.12 nJ79.52%Static energy/cycle25.00 nJ12.96 nJ48.16%Total energy/cycle50.0 nJ18.08 nJ63.84%Dynamic power50.0 W4.05 W91.90%Static power50.0 W10.25 W79.50%Total power100.0 W14.20 W85.80%

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Problem of Process Variation in Nanometer TechnologiesLower VthVthHigher VthNumber of chipsPowerspecificationClockspecificationFrom a presentation: Power Reduction using LongRun2 in TransmetasEfficon Processor, by D. DitzelMay 17, 2006Yield lossdue to highleakageYield lossdue to slowspeedHigher voltage operationLower voltage operationNominalvoltage

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Pipeline GatingA pipeline processor uses speculative execution.Incorrect branch prediction results in pipeline stalls and wasted energy.Idea: Stop fetching instructions if a branch hazard is expected:If the count (M) of incorrect predictions exceeds a pre-specified number (N), then suspend fetching instruction for some k cycles.Ref.: S. Manne, A. Klauser and D. Grunwald, Pipeline Gating: Speculation Control for Energy Reduction, Proc. 25th Annual International Symp. Computer Architecture, June 1998.

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Slack SchedulingApplication: Superscalar, out-of-order execution:An instruction is executed as soon as the required data and resources become available.A commit unit reorders the results.Delay the completion of instructions whose result is not immediately needed.Example of RISC instructions: addr0, r1, r2;(A) sub r3, r4, r5;(B) and r9, r1, r9;(C) or r5, r9, r10;(D) xor r2, r10, r11;(E)J. Casmira and D. Grunwald,Dynamic Instruction SchedulingSlack, Proc. ACM Kool ChipsWorkshop, Dec. 2000.

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Slack Scheduling Example

    Slack schedulingABCDE

    Standard schedulingABCDE

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Slack SchedulingSlack bitLow-power execution unitsRe-order bufferScheduling logic

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Clock Distribution H-TreeclockFanout, = 4

    Tree depth, s = logN

    No. of flip-flops = N

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Clock PowerPclk= CLVDD2f + CLVDD2f / + CLVDD2f / 2 + . . .

    stages 1 1= CLVDD2f n = 0n

    where CL =total load capacitance of N flip-flops

    =constant fanout at each stage in distributionnetworkClock consumes about 40% of total processor power, becauseClock is always activeMakes two transitions per cycle, ( = 2)Clock gating is useful; inhibit clock to unused blocks

    ELEC5270/6270 Spring 11, Lecture 14

  • Properties of H-TreeBalanced clock skew.Small delay and power consumption.Requires fine-tuning for complex layout.Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    ELEC5270/6270 Spring 11, Lecture 14

  • Clock Power and DelayUnit size buffer or inverter delay = dTotal dynamic power supplied to N flip-flops, P = CLVDD2fTotal power consumption of clock network:Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*

    Flip-flps, NClock power per flip-flopClock delay1Pd4P4d161.25P8d641.3125P12d1281.327125P16d

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Clock Network ExamplesD. W. Bailey and B. J. Benschneider, Clocking Design and Analysis for a 600-MHz Alpha Microprocessor, IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.

    Alpha 21064Alpha 21164Alpha 21264Technology0.75 CMOS0.5 CMOS0.35 CMOSFrequency (MHz)200300600Total capacitance12.5nFClock gating used. Total power 80 -110WClock load3.25nF3.75nFClock power40%40% (20W)Max. clock skew200ps (

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*Power Reduction ExampleAlpha 21064: 200MHz @ 3.45V, power dissipation = 26WReduce voltage to 1.5V, power (5.3x) = 4.9WEliminate FP, power (3x) = 1.6WScale 0.75 0.35, power (2x) = 0.8WReduce clock load, power (1.3x) = 0.6WReduce frequency 200 160MHz, power (1.25x) = 0.5WJ. Montanaro et al., A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor, IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996.

    ELEC5270/6270 Spring 11, Lecture 14

  • Copyright Agrawal, 2007ELEC5270/6270 Spring 11, Lecture 14*For More on MicroprocessorsT. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002.R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002.

    ELEC5270/6270 Spring 11, Lecture 14