Extreme Scale Computer Architecture: Energy Efficiency...
Transcript of Extreme Scale Computer Architecture: Energy Efficiency...
![Page 1: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/1.jpg)
Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up
Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
WNTC WorkshopDecember 2012
![Page 2: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/2.jpg)
Josep TorrellasExtreme Scale Computing 2
• Extreme Scale computing: 100-1000x more capable for the same power consumption and physical footprint
• Exascale (1018 ops/cycle) datacenter: 20MW • Petascale (1015 ops/cycle) departmental server: 20KW• Terascale (1012 ops/cycle) portable device: 20W
Wanted: Energy-Efficient Computing
![Page 3: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/3.jpg)
Josep TorrellasExtreme Scale Computing 3
Energy-Efficiency Gap
• Goal: • 20W Tera-Op (sustained) • 20 pJoules/operation
• In comparison: • IBM Power7 released 2010: MCM 800W for 1TFlop Peak
• Problem is harder than it looks: • Machines spend much of the energy transferring data • Minimizing E in data transfer, not ALU op is the
challenge
![Page 4: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/4.jpg)
Josep TorrellasExtreme Scale Computing 4
Recap: How Did We Get Here?
• Ideal Scaling (or Dennard Scaling): Every semicond. generation:– Dimension: 0.7– Area of transistor: 0.7x0.7 = 0.49– Supply Voltage (Vdd), C: 0.7– Frequency: 1/0.7 = 1.4
Area: Ax transistors
Power density: CVdd2f/A
x transistorsArea: 0.72A
Power density: 0.7C 0.72Vdd2 1.4f/0.72A
= CVdd2f/A
Constant power density
![Page 5: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/5.jpg)
Josep TorrellasExtreme Scale Computing
Recap: How Did We Get Here ? (II)
• Real Scaling: Vdd does not decrease much.– If too close to threshold voltage (Vth) slow transistor– Delay of transistor is inversely prop to (Vdd - Vth)
– Dynamic power density increases with smaller tech
• Additionally: There is the static power
Power density increases rapidly
![Page 6: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/6.jpg)
Josep TorrellasExtreme Scale Computing 6
Design for E Efficiency from the Ground Up
• New designs for chips with 1K cores:– Efficient support for high concurrency– Data transfer minimization
• New technologies:– Low supply voltage (Vdd) operation– Efficient on-chip voltage regulation– 3D die stacking– Resistive memory– Photonic interconnects
![Page 7: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/7.jpg)
Josep TorrellasExtreme Scale Computing 7
Thrifty Multiprocessor
• Funded by DOE, DARPA, NSF• Runnemede project lead by Intel and
funded by DARPA UHPC [HPCA2013]
64B
cr
ossb
ar
netw
ork
16B
crossbar
Bar rier
Net
wor k
64B
cr
ossa
brne
twor
k
16B
crossbar
Bar rier
Net
wor k
64B
cr
ossb
ar
netw
ork
16B
crossbar
Bar rier
Net
wor k
64B
cr
ossa
brne
twor
k
16B
crossbar
Bar rier
Net
wor k
7
1,000 core chip
Stacked DRAM
7....
CPU module
Board
Cabinet
![Page 8: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/8.jpg)
Josep TorrellasExtreme Scale Computing 8
Low Voltage Operation
• Vdd reduction is the best lever for energy efficiency:• Big reduction in dynamic power; also reduction in static power
• Reduce Vdd to bit higher than Vth (Near Threshold Voltage--NTV)• Corresponds to Vdd of about 0.55V rather than current 0.9V
• Advantages:• Potentially reduces power consumption by more than 40x
• Drawbacks:• Lower speed (1/10)• Increase in gate delay variation
![Page 9: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/9.jpg)
Josep TorrellasExtreme Scale Computing
9
Basics of Parameter Variation
• Deviation of device parameters from nominal values: eg Vth, Leff
Additionally: Same ∆Vth causes higher ∆f and ∆P at NTV
Chip PSTA ↑
PS
TA
Vth
low Vth high VthVthNOMτVAR
Nu
mb
er
of
pat
hs
τ
Chip f ↓
τNOM
![Page 10: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/10.jpg)
Josep TorrellasExtreme Scale Computing
10
Variation in Thrifty Manycore
Intra-Core Intra-Local Mem
Inter-Mem
Ma
x/M
in R
atio
of
Fre
qu
en
cy
1
2
3
4
5
0
NTVConventional
• Larger f variation at NTV• Memories more vulnerable• Power varies more
Cluster
Local MemoryCore +
ClusterMemory
Using VARIUS-NTV by Karpuzcu et al
![Page 11: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/11.jpg)
Josep TorrellasExtreme Scale Computing
Multiple Vdd Domains at NTV: Hardly Effective
• On chip regulators have a high power loss (10+%)
• To reduce costs, only coarse-grain (multiple-core) domains • Already has variation inside the domain
• Small Vdd domain more susceptible to load variations• Larger Vdd droops need increase Vdd guardband
Work with:Ulya Karpuzcu (U Minn) and Nam Sung Kim (U Wisc)
![Page 12: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/12.jpg)
Josep TorrellasExtreme Scale Computing
12
Propose: Energy Efficiency with a Single Vdd Domain
• Each cluster in the chip is a f domain• Allocation in units of multiples of clusters called Ensembles
• Whole ensemble clocked at a single f• Simpler variation-aware core allocation
ClusterMemory
Core + Local Memory
One Vdd domain, many f domains• Simple hardware, simple & effective core allocation
![Page 13: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/13.jpg)
Josep TorrellasExtreme Scale Computing
13
Effectiveness of Single Vdd Domain per Chip
Single Vdd is more E efficient
Sin
ge
Vd
d
Pe
rfe
ct
Re
gu
lato
r p
ow
er
loss
+ C
oa
rse
gra
in
Vd
dd
om
ain
s
+ L
arg
er
Vd
dM
arg
in0.4
No
rma
lize
d M
IPS
/Wat
t
0.6
0.8
1.0
15%
15%
5%10%
20%25% Realistic
288-core chip with 8-core clusters
![Page 14: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/14.jpg)
Josep TorrellasExtreme Scale Computing 14
Needed: Efficient On-Chip Vdd Regulation
• Voltage regulators (VRs) have to be designed for high efficiency– Hierarchical design:
• First level placed on a different die; second level regulate a small range only
From Nam Sung Kim, UWisc
• Energy-efficient design requires short Vdd guardbands– Need to tackle voltage droops due to load variation
![Page 15: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/15.jpg)
Josep TorrellasExtreme Scale Computing
Streamlined 1K-core Architecture
• Very simple cores (no structures for speculative execution)• Cores organized in clusters with memory to exploit locality• Each cluster is heterogeneous (has one large core)• Special instructions for certain ops: fine-grain synch• Single address space without hardware cache coherence
15
![Page 16: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/16.jpg)
Josep TorrellasExtreme Scale Computing
• On-chip memory leakage: major contributor of the NTV chip power• Coarse-grained proposals are insufficient
• Turn off some memory modules / disable cache ways / …• Needed: power-on only the lines that contain useful data• Proposal
• Use on-chip memory technology that does not leak (eDRAM) ---but needs to be refreshed
• Use fine-grain, intelligent refresh of the on-chip memory• Great opportunity of major power savings
• Much of the on-chip memory contains useless data!
Managing the Power of On-Chip Memories
![Page 17: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/17.jpg)
Josep TorrellasExtreme Scale Computing
• Cold lines: Lines not used or used far apart in time
When Useless Refresh Happens
• Hot lines: Lines actively used
![Page 18: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/18.jpg)
Josep TorrellasExtreme Scale Computing
• When to refresh:
Polyphase: Intelligent Refresh
• Divide the retention period into equal intervals called Phases• Maintain for each line: phase in which it was last accessed
(or refreshed)• A line is refreshed only when the same phase arrives in the
next retention period.
![Page 19: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/19.jpg)
Josep TorrellasExtreme Scale Computing
• What to refresh:
Polyphase: Intelligent Refresh
• Use state of the line:• Valid data but timeout: WB (n,m)
• Dirty lines refreshed n times before writeback• Clean lines refreshed m times before inval
![Page 20: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/20.jpg)
Josep TorrellasExtreme Scale Computing
Simple Hardware
• When to refresh: • Cache controller keeps, for each line, the
phase it was last refreshed/accessed• At the beginning of phase: controller
checks for lines with matching phase• For each line: 2 bits for phase, 1 for valid
What to refresh:• Keep a per-line countdown of refreshes
• Reset at access• Decrement at refresh.
• When counter reaches zero, wb/inval
• 40-60% reduction in on-chip memory energy with no slowdown
![Page 21: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/21.jpg)
Josep TorrellasExtreme Scale Computing 21
Minimizing Data Movement
• Thrifty has several techniques to minimize data movement:• Many-core chip organization based on clusters• Mechanisms to manage the cache hierarchy in software• Simple compute engines in the mem controllers Processing
in Memory (PIM)• Efficient synchronization mechanisms
![Page 22: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/22.jpg)
Josep TorrellasExtreme Scale Computing
Software Managed Caches (SMC)
• When core references data, HW brings a copy of line to cache from first level of cache it finds it in• May not be latest version
• Writes do not invalidate/update other copies of the line• Need instructions to perform explicit write-back and invalidate
12/4/2012 22
processor
Local mem
Cluster mem
P1
Local mem
processor
P2
1: Writebackaddr (line)
2: Invalidateaddr (line)
3: Read addr(line)
![Page 23: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/23.jpg)
Josep TorrellasExtreme Scale Computing
SMC Programming
• Programmer/compiler inserts data-movement instructions at synchronization points
• Hopefully minimizes data transferred over hardware coherence
ST A[i]
WB A[i]
INV A[1]LD A[1]
ST B[1]WB B[1]
ST A[i]
WB A[i]
INV A[2]LD A[2]
ST B[2]WB B[2]
barrier
Thread 1 Thread 2
barrier
Current epoch
Next epoch
Current epoch
past epoch
![Page 24: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/24.jpg)
Josep TorrellasExtreme Scale Computing 24
Processing in Memory
Micron’s Hybrid Memory Cube (HMC) [Micron10]:
• Memory chip with 4 or 8 DRAM dies over 1 logic die
• Can be placed in an MCM with processor dies• DRAM dies only store data while logic die
handles DRAM control
Future use of logic die:• Support for Intelligent Memory Operations?
• Preprocessing data as it is read from memory• Performing processor commands “in place”
![Page 25: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/25.jpg)
Josep TorrellasExtreme Scale Computing 25
Supporting Fine-Grain Parallelism
• Synchronization and communication primitives• Efficient point-to-point synch between two cores (F/E bits)• Dynamic hierarchical hardware barriers
......
![Page 26: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/26.jpg)
Josep TorrellasExtreme Scale Computing
What We Learned
• Naively translating programs written for coherent caches into SMC results in inefficient codes– Need: good development tools
• Using fine-grain synchronization with F/E bits is hard– It typically requires complete re-write of the code
26
![Page 27: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/27.jpg)
Josep TorrellasExtreme Scale Computing 27
Programmability
• Programming highly-concurrent machines has required heroic efforts• Extreme-scale architectures, with emphasis on power-efficiency, may
make it worse– Low Vdd requires more concurrency to attain same performance– Need carefully manage locality and minimize communication
![Page 28: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/28.jpg)
Josep TorrellasExtreme Scale Computing 28
How to Program for High Parallelism?
• Expert programmers• Hooks to manage power and Vdd/frequency• Ability to map and control tasks
• Novice programmers: • High level programming models that express locality
• Hierarchical Tiled Arrays (HTA): computes in recursive blocks• Concurrent Collections (CnC): computes in a dataflow manner
• Autotuning?
• … open problem
![Page 29: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/29.jpg)
Josep TorrellasExtreme Scale Computing 29
Conclusion
• Presented the challenges of Extreme Scale Computing: • Designing computers for energy efficiency from the ground up
• Described some of the architecture and design ideas• Programmability may suffer: need focus on the software• There is a tradeoff between energy efficiency and resilience
![Page 30: Extreme Scale Computer Architecture: Energy Efficiency ...web.cse.ohio-state.edu/~teodorescu.1/workshops/... · Extreme Scale Computing 8 Low Voltage Operation •V dd reduction is](https://reader034.fdocuments.in/reader034/viewer/2022042413/5f2d6a1e86963810522e7b76/html5/thumbnails/30.jpg)
Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up
Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
WNTC Workshop December 2012