Low power Architecture - Rochester Institute of...
Transcript of Low power Architecture - Rochester Institute of...
Low-power Architecture
By: Jonathan HerbstScott Duntley
Why low power?
• Has become necessary with new-age demands:o Increasing design complexityo Demands of and for portable equipment
CommunicationMediaMobile computers
o Most embedded systems run on batteriesObjective to extend battery life as long as possible without sacrificing too much performance
o Lower running costs $$• Go green!
Low power architecture
• Memory techniqueso Associativityo Low Power refresho Drowsy cache
• Bus Techniqueso Bus inversion
• ISA• Branch prediction
• Parallel Processing vs. Superpipelining
• Clock gating/scaling
• Voltage Scaling
• Cortex A8
Memory - Associativity• Direct-mapped cache - Least power -> no block searching
• Conventional Set associativeo As block read occurs -> Both tag and data arrays reado Data written to bus -> Only used if tags matcho As associativity , power consumption
• Alternative: Phased-set associativeo Tag and data are broken in sub-arrayso Only tag array is read and comparedo Data sub-array r/w to a buffer upon cache hit, and then to
the buso Advantage: Less power consumption by avoiding
unnecessary data readso Disadvantage: Takes 2 clock cycles rather than one
Memory - Phased set associative
Phased Set Associative Cache
Memory - Associativity - BenchmarkCache Type Miss Rate Average Power Increase
from Direct-Mapped
Direct-Mapped .046 -
4-way Set Associative .035 85.6%
4-way Phased Set Associative
.035 68.5%
Cache power analysis
Power Management
Static• Power Domains
• Voltage Domains
Dynamic• Clock Scaling/Gating
• Voltage Scaling
• Wait-For-Interrupt
Memory - Drowsy cache
• Modern processors -> Growing cache sizeo Contributes a size-able fraction of a chip's power
consumptiono As transistor sizing decreases -> large amount of power
due to leakage
• Idea: Put the cold cache lines into a state-preserving low power state to prevent leakage currento Low-power state = 25% of full-power energy
• Disadvantage: Slight performance loss due to the "wake-up" time required to access drowsy cache
Drowsy cache - Benchmark
Drowsy cache benchmark
Buses - Bus inversion
Where,• Alpha = switching factor• f = clock frequency• C = capacitance• V = voltage
Want to:• Minimize switching factor
• Bus lines are normally of high capacitanceo Large amount of power consumption due to switching
Idea:• If the # of bits on an N bit line that need to switch are > N/2
o Invert entire line, and then switch necessary bits back
o Advantage: Less power consumedo Disadvantage: More hardware needed
Buses - Bus inversion
Bus Inversion
Buses - Bus inversion
Parallel Processing and Pipelining
Parallel Computations• Multiple cores• Multiple Issue pipelines• Linear power increase
Pipelining• Faster clock• Exponential power increase• Longer branch miss-predictions
Low power & ISA
• Single Issue, Multiple Data (SIMD)o Reduce number of instruction fetches/decodes -> Reduce
power• RISC vs. CISC
o ASP Embedded - CISCMore specific hardware helps reduce overhead from general hardware -> less power
o General Embedded - RISCLess specific operations neededReduced complexity helps with power consumption
o The line is blurring - less and less need for ASP processors since GPP's are rapidly becoming more powerful and low-power
Branch prediction techniques• Accurately predict branches without too much complexity
o Static branch predictionSimple, done at compile time by ISAExamination of program behaviorChoose backward branches taken, forward branches not
o Dynamic Branch PredictionMore complex, More hardwareOccurs during run-timeHigher power consumption but much more accurateBranch Target Buffer (BTB)Pattern history table (PHT)
Cortex A8 Die
Cortex A8 Architecture
Architecture Overview
• < 300 mW to 1 W Power Consumption
• 600 MHz at 1.08 V, 1 GHz at 0.9 V Configuration (up to 1.5 GHz, but suffers a significant power increase)
• 13 cycle, 2 issue superscalar pipeline
• Static scheduling scoreboard
• Integrated NEON multimedia pipeline
• Static and dynamic power management
Static Scheduling Scoreboard
• Static instruction scheduling• In-order issue, in-order retire• Dynamic voltage and clock scaling
Pending Queue:• Takes better advantage of 2-issue pipeline
Replay Queue:• Holds issue information only• Avoid long cache miss stalls
Instruction Set Architecture
• RISC Architecture
• 2-issue instructions
• Multicycle instructions
• SIMD Instructions for NEON
• Shift included instructions
• 32-bit instructions compressed to 16-bit for a 30% code reduction
Branch Prediction
• 95% accuracy• 10-bit Global History Register (GHR)• 4096 entry (256x16) Global History Buffer (GHB) with 2-bit
saturating counterso column indexed by first 8 bits of GHRo row indexed by last two bits of GHR XORed with low 4
bits of PC• 512 entry Branch Target Buffer (BTB)
o indexed by addresso stores branch address and branch type
• 1 stall cycle on branch taken• 13 cycle penalty on missprediction
MemoryL1 Cache• 32 or 64 KB• Separate instruction and data cache• 1 cycle latency• 4-way set associative• Hash Virtual Address Buffer
Data Cache• 3 entry 64-bit integer store buffer• 8 enrty 128-bit NEON store buffer
L2 Cache• Up to 1 MB• 8 cycle latency• 8-way set associative• Nonblocking NEON loads
Static Power Management
Power Domains
Static Power Management
Voltage Domains
Dynamic Power Management
• Wait-For-Interrupt Architecture
• Clock gating
• Voltage scaling
Future of ARM?
• ARM chips currently offered at $10-20 a pieceo Intel atom -> $35+
• ARM currently controls about 90% of the mobile phone processor market -> Low Price/Powero Intel still needs more R&D to be able to compete with
ARM power specs• Why not for laptops/netbooks?
o Regular Windows cannot run it (Linux/Android)Windows Mobile/CE (Embedded Compact)
o Excludes main part of consumer PC marketo Mainstream version release of windows -> Supports ARM
ARM could easily move into market• Increasing parallelism
• Increased performance-to-power ratio
Future/Theoretical : DRAM Refresh• Two ideas, but not necessarily implemented yet:
o Intelligent RefreshIdea: A cell that has been written or read to recently does not need to be refreshedMost effective power reduction during periods of great useDrawback: Large amount of overhead needed to keep track of which cells have been accessed recently
o OS Controlled RefreshIdea: Not necessary to refresh unused memory so disable itThe OS knows what memory has been usedInstead of only swapping out pages when memory is full, swap out unused memory -> No refresh
Conclusion
• Basic idea - Reduce powero Trade-off->low performance and/or more complexity
• Recent architecture and design trendso Static power becoming as important as dynamic
Dynamic
Static
o Reduce any of these, reduce overall power