New Rules: Sustaining Performance Scaling in a Physical...
Transcript of New Rules: Sustaining Performance Scaling in a Physical...
8/30/16
1
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
New Rules: Sustaining Performance Scaling in a Physical World
Sudhakar Yalamanchili
& Many Students and Collaborators
Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems
School of Electrical and Computer Engineering Georgia Institute of Technology
Sponsors:
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Its a Physical World
2
Next Generation Applications
Multi-scale Physical
Phenomena
Architecture & Package
PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
10 orders of magnitude scale:
10-9 - 101
8/30/16
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
1. Scaling Power and Thermal Management n Coordinated chip scale management
2. Towards Data Centric Computing n Back to the future
3. Pushing Back the Thermal Wall
n Challenges
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
1. Scaling Power and Thermal Management
4
8/30/16
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Power and Performance
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
Power Supply (regulation) + Power Consumption + Cooling
W. J. Dally, Keynote IITC 2012
5
www.commons.wikimedia.org enercon.com imaging1.com
You can hide latency but not energy!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Power and Thermal Limits for Multicore
n Based on scaling using Pentium-class cores modeled with Intsim1
1D. Sekar, A. Naemi, R. Savari, J. Davis, and J. Meindl, “IntSim: A CAD tool for optimization of multilevel interconnect networks,” Proceedings of the IEEE/ACM international conference on Computer-aided Design, 2007
6
Dark Silicon Gap
8/30/16
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Power and Thermal Capacities are Resources
7
mikedavisfaia.wordpress.com
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Managing Thermal Capacity
Graphics Processing Unit (GPU) : 384 Radeon Cores
Multithreaded CPU cores
§ Many resources are shared between the CPU and the GPU – For example, memory hierarchy, and on-chip network
Accelerated Processing Unit (APU)
8
Shared Northbridge ! access to overlapping CPU-GPU physical address spaces
8/30/16
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Programming Model
n Coupled programming model à Offload compute intensive tasks to the GPU n Consequences for energy or power management?
APU Hardware
CPU
Operating System
User Application
OpenCL™ or other Software Stack
Host Tasks
GPU Tasks
GPU
Each OpenCL kernel
Grid of threads, each operating over a data partition
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Managing Thermal Capacity: Thermal Coupling
10
n Significant rise in temperature of the idle component due to thermal coupling and pollution
n CPU cores consume thermal headroom more rapidly (4X faster)
n GPUs sustain power boosts longer!
n Better management for 10%-40% gains in measured energy efficiency are possible
n Power management ≠ thermal management
Temperature on Core 2 when Core 3 is busy and remaining cores are idle
0 1 2 3
I. Paul, S. Manne, M. Arora, W.L. Bircher, S. Yalamanchili, “Cooperative boosting: needy versus greedy power management”, ISCA 2013.
8/30/16
6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thermal Coupling Effects
11
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5
1
1.5
2
2.5
3
3.5
1 18
35
52
69
86
103
120
137
154
171
188
205
222
239
256
273
290
307
324
341
Peak
Die
Tem
pera
ture
CPU
& G
PU R
elat
ive
Pow
er
Time (seconds) ->
GPU Pow CPU CU0 Pow CPU CU1 Pow PeakDieTemp
CPU power is limited, GPU running at max DVFS state
Thermal coupling
Temp thro>ling
I. Paul, S. Manne, M. Arora, W.L. Bircher, S. Yalamanchili, “Cooperative boosting: needy versus greedy power management”, ISCA 2013.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Need for Package Scale Coordinated Thermal Management!
12
8/30/16
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Allocating Thermal Capacity: Cooperative Boosting
1.28 1.10 1.13 1.10 1.04
1.36
1.00 0.99 1.10
0.50 0.70 0.90 1.10 1.30 1.50
NDL HS BF FAH BS Viewdle Lbm Perl MEAN
Spee
dup
P0 P2 P4 CB Baseline CPU DVFS
-state HW Only
(Boost)
Pb0 Pb1
SW-Visible
P0 P1 P2 - - - Pmin
(Commercial Part) static power cap
Coordinated, opportunistic consumption of thermal
capacity
I. Paul, S. Manne, M. Arora, W.L. Bircher, S. Yalamanchili, “Cooperative boosting: needy versus greedy power management”, ISCA 2013.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Power and Thermal Limits for Multicore
n Based on scaling using Pentium-class cores modeled with Intsim1
1D. Sekar, A. Naemi, R. Savari, J. Davis, and J. Meindl, “IntSim: A CAD tool for optimization of multilevel interconnect networks,” Proceedings of the IEEE/ACM international conference on Computer-aided Design, 2007
14
Power and Thermal capacities are the new shared resources
Dark Silicon Gap
8/30/16
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Managing Power Capacity
15
0%
10%
20%
30%
40%
50%
Total Force Neighbor Comm Other % in
crea
se in
run
-tim
e
CPU DVFS per kernel in miniMD ->
P0 P1 P2 P3 P4
Dynamic Demand Power Sensitivity à Energy Efficiency1
Design Space2 (BFS) Sensitivity
1I. Paul, et.al, “Coordinated Energy Management in Heterogeneous Processors” SC13 2A. McLaughlin et.al, “A Power Characterization and Management of GPU Graph Traversal,” ADMS 2014
Need more DVFS states?
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Its Not Just Compute: Hardware Balance
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Nor
mal
ized
Per
form
ance
Normalized Available ops/byte in Hardware
Compute Limiting Memory Bandwidth Limiting
Relative energy costs of compute
and memory access
Relative ops/byte demand of application
Hardware Balance
Coordinate Power States to Achieve Balance
Balance Point How efficiently is energy used in the core or in memory system?
I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.
8/30/16
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Shift in the Balance Point
38
123
207 292
0 1 2 3 4 5 6 7 8
200 400 600 800 1000 Ba
ndwidth (G
B/s)
Normalized
Perform
ance
Core Frequency (MHz)
38
151
264
0 2
4
6
8
10
12
4 8 12 16 20 24 28 32 36 40 44 Band
width (G
B/s)
Normalized
Perform
ance
# AcLve CUs
Balance plane for performance and energy
I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.
17
Relative energy costs of compute
and memory access
Relative ops/byte demand of application
Hardware Balance
Up to 36% power savings with a maximum performance loss of 3.6%
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Sensitivity Analysis for Chip Scale Coordinated Power Management
18
8/30/16
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Framework for Managing Power Capacity
Assign Power Budgets
regulator
reg reg
Temperature Regulation
Power Regulation
K. Rao, W. Song, S. Yalamanchili, and Y. Wardi, “Temperature Regulation in Multicore Processors using Adjustable-Gain Integral Controllers,” IEEE
Multi-Conference on Systems and Control, September, 2015.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Power Regulation
Power regulation via power state transitions
Unregulated
Regulated
8/30/16
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2. Emergence of Data Centric Systems
21
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Energy Cost of Data Management
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
Three operands x 64 bits/operand
DataMovementEnergy = #bits× dist −mm× energy− bit −mm
*S. Borkar and A. Chien, “The Future of Microprocessors, CACM, May 2011
Operator_cost + Data_movement_cost + Storage_cost
W. J. Dally, Keynote IITC 2012
22
*
8/30/16
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Interconnect Energy Taper: Electrical
n Relative costs of compute and memory accesses n Time and energy costs have shifted to data movement
Courtesy Greg Astfalk, HP
Core Core Core Core
L1$ L1$ L1$ L1$
Last Level Cache
DRAM
1’s ns
ms
Data Access Latency
10’s ns
100’s ns
Data Access Energy
23
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Pin Bandwidth Challenges1
n Number of transistors/die continues to grow n Number of pins growing at a slower rate than #transistors n Number of supply pins are crowding out data pins
n Reducing supply current/pin limits growth of #transistor/die
1P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten, “ Pinned to the Wall – Impact of Packaging and Applications on the Memory and Power Walls,” IEEE/ACM international symposium on Low-Power Electronics and Design (ISPLED), 2011
Data pin bandwidth is not growing as fast as number of transistors on chip
CPU Die
Package Substrate
PCB To DRAM
Die
CPU Die DRAM Die
Si Interposer
Package Substrate
PCB
24
8/30/16
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Back to the Future: Processing Near Memory
25
PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
Kogge – Execube (4K DRAM + 100K gate parallel processor (www3.nd.edu)
Courtesy Gokul Kumar
1990’s
Today
3D Interposer System
BGA
TPVs
Memory Stack
Logic Die
PWB
Draper et.al, DIVA chip (isi.edu)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Back to Power and Thermal Management!
26
Thermal Coupling in Three Dimensions
8/30/16
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The (Re)-Emergence of Near Data Processing
Silicon Interposer
Multicore Chip
DRAM S
tack
s
Logic Tier
Memory Tiers
New BW Hierarchy and energy taper
• Hybrid Memory Cube (HMC) • High Bandwidth Memory (HBM)
• Wide I/O
Compute Package Capacity Tier Memory
27
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NeuroCube: Programmable Digital Neural Emulator
Link
TSV
VC R R R RNoCPE PE PE PE
μ
Vault
DRAM die
PNG
...
...
...
wi,j
xi
Mapped into Memory
Current layer
Host
For all layers One layer is done
Program & initiate new layer
Layerwise operation
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A Programmable Digital Neuromorphic Architecture with High Density3D Memory,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2016.
High Bandwidth exploited for both training and inference
Collaboration with S. Mukhopadhyay
8/30/16
15
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3. Pushing Back the Thermal Wall
29
Performance
Reliability Energy/Power
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Increasing Thermal Capacity
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
W. J. Dally, Keynote IITC 2012
megahdwall.com
Phase Change Material Microfluidics
wisefull.com
Conventional
Increase Performance/ft3 Impact on Revenue
8/30/16
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Microfluidic Cooling
31
• Fluid flow through the microchannels carry heat out to an external heat exchanger (e.g., heat sink)
Courtesy L. Zheng (ECE) and Professor Muhannad Bakir (ECE)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Fabrication
32
Courtesy L. Zheng (ECE) and Professor Muhannad Bakir (ECE)
# / cm2 Specifications
Micropin-fins 1,936 Diameter =150 µm and Pitch = 225 µm
TSVs (4x4 array) 30,976 Diameter = 13 µm and Pitch = 24 µm
ICECOOL
8/30/16
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
FE
SCH
DL1
INT
FPU
Workload Cooling Co-Design
33
Nehalem-like, OoO cores; 3GHz, 1.0V, max temp 100◦C DL1: 128KB, 4096 sets, 64B IL1: 32KB, 256 sets, 32B, 4 cycles;
L2 & Network Cache Layer: L2 (per core): 2MB, 4096 sets, 128B, 35 cycles; DRAM: 1GB, 50ns access time (for performance model)
Ambient: Temperature: 300K
• Thermal Grids: 50x50 • Sampling Period: 1us • Steady-State Analysis
2.1mm x 2.1mm
8.4mm x 8.4mm
16 symmetric cores L2 L2 L2L2
L2 L2 L2L2
L2 L2 L2L2
L2 L2 L2L2
2.1m
m
2.1mm
8.4 mm
8.4 mm
H. Xiao, Z. Min, S. Yalamanchili and Y. Joshi, “Leakage Power Characterization and Minimization over 3D Stacked Multi-core Chip with Microfluidic Cooling,” IEEE Symposium on Thermal Measurement, Modeling, and Management (SEMITHERM), March 2014
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Architecture-Cooling Co-Design: Performance Normalized EPI comparison among 4
pin fin structures
34
Zhimin Wan, He Xiao, Yogendra Joshi, Sudhakar Yalamanchili, “Co-Design of Multicore Architectures and Microfluidic Cooling for 3D Stacked ICs,” Microelectronics Journal, May 2014.
8/30/16
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thermally Adaptive Architecture: Delay Scaling
LCC access reduction of 42% from 300oK to 400oK
n Increase the dynamic operating range of the package
n Increase performance per volume
n Coupled with power management
Cu heat spreader Memory Memory Cache
Processor
BT substrate
PCB
Back Side Air Cooling
H. Xiao, W. Yueh, S. Mukhopadhyay and S. Yalamanchili, “Thermally Adaptive Cache Access Mechanisms for 3D Many-Core Architectures,” IEEE Computer Architecture Letters, October 2015.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Technology Scaling & Thermal Capacity
36
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00
11 14 18 20 24 28 Performan
ce increase from
ba
selin
e
Technology Node
Pkts/sec with Air Cooling
Pkts/sec with Liquid Cooling
Frequency must be reduced to keep max. temperature at
100°C, reducing performance
Estimated Scaling Factors
28nm 24nm 20nm 18nm 14nm 11nm
Area Scaling 1.00 0.74 0.52 0.42 0.25 0.16 Dynamic Power
Scaling 1.00 0.78 0.57 0.53 0.30 0.19
Leakage Power Scaling 1.00 0.96 0.90 0.83 0.80 0.75
Power Density 1.00 1.12 1.28 1.46 1.76 2.17
Altera Networking App 28 24 20 18 14 11 Estimated Avg. Power
Density (W/cm2) 22.82 26.56 29.87 34.59 39.09 50.73
Parametric scaling models tuned based on foundry data
Flow outlet Flow
inlet
Altera Networking Design: Single 3-die 3D FPGA
8/30/16
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Visibility: The Energy Stack
37
Circuit level: DVFS, power states, clock gating
Chip and Package: power multiplexing, spatiotemporal
migration
Board: applications, virtual power, operating system,
middleware
Rack: mechanical design, airflow, HVAC
Memory(SRAM)
Processor(Logic)
Memory(DRAM)
n Multi-scale: Power, Performance and Reliability (PPR) events occur at distinct time scales for distinct durations
n Impact of workload models at all levels
n Transient vs. steady state
n Importance of composition of models for use in systems as well as simulators
Needed: Visibility at the Higher Levels!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
What Next ?
Novel Cooling
Technology
Thermal Field
Modeling
Power Distr.
Network Power
Management
μarchitecture
Algorithms
Microarchitecture and Workload ExecuLon
Power DissipaLon
Thermal Coupling and Cooling
DegradaLon and Recovery
Training Data
Model Parameters
Regression, Model Estimation
On-line, Coodinated Global Power and Thermal Management
Building Models
Power-Architecture-Thermal Co-Design Architecture-Physical Modeling
Energy Efficient Design
8/30/16
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Acknowledgements: Collaborators
n Students n Nawaf Alamoosa, Xinwei Chen, Minki Cho, Minhaj Hassan, Chad
Kersey, Duckhwan Kim, Jaeha King,, Karthik Rao, William Song, Zhimin Wan, He Xiao
n Colleagues n Muhannad Bakir, Sek Chai, Yogendra Joshi, SriLatha Manne, Saibal
Mukhpadhyay, Indrani Paul, Yorai Wardi
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40
Applications &
Architecture
Power
NoC
Technology & Cooling
Thank YouQuestions?
Scaling Performance