Endeca @ NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected].
-
Upload
naomi-pearson -
Category
Documents
-
view
219 -
download
0
Transcript of 1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected].
2
Acknowledgements
Students Mark E. Femal – NCSU Nandini Kappiah – NCSU Feng Pan – NCSU Robert Springer – Georgia
Faculty Vincent W. Freeh – NCSU David K. Lowenthal – Georgia
Sponsor IBM UPP Award
3
The case for power management in HPC
Power/energy consumption a critical issue Energy = Heat; Heat dissipation is costly Limited power supply Non-trivial amount of money
Consequence Performance limited by available power Fewer nodes can operate concurrently
Opportunity: bottlenecks Bottleneck component limits performance of other components Reduce power of some components, not overall performance
Today, CPU is: Major power consumer (~100W), Rarely bottleneck and Scalable in power/performance (frequency & voltage)
Power/performance“gears”
4
Is CPU scaling a win?
Two reasons:1. Frequency and voltage scaling
Performance reduction less than Power reduction
2. Application throughputThroughput reduction less thanPerformance reduction
Assumptions CPU large power consumer CPU driver Diminishing throughput gains
performance (freq)
pow
er
ap
plic
ati
on t
hro
ug
hput
performance (freq)
(1)
(2)
CPU powerP = ½ CVf2
5
AMD Athlon-64
x86 ISA 64-bit technology Hypertransport technology – fast memory bus Performance
Slower clock frequency Shorter pipeline (12 vs. 20) SPEC2K results
2GHz AMD-64 is comparable to 2.8GHz P4P4 better on average by 10% & 30% (INT & FP)
Frequency and voltage scaling 2000 – 800 MHz 1.5 – 1.1 Volts
6
LMBench results
LMBench Benchmarking suite Low-level, micro data
Test each “gear”
GearFrequency (Mhz)
Voltage
0 2000 1.5
1 1800 1.4
2 1600 1.3
3 1400 1.2
4 1200 1.1
6 800 0.9
7
Operations
8
Operating system functions
9
Communication
10
Energy-time tradeoff in HPC
Measure application performance Different than micro benchmarks Different between applications
Look at NAS Standard suite Several HP application
ScientificRegular
11
Single node – EP
CPU bound:•Big time penalty•No (little) energy savings
+11%-2%
+45%+8%
+150%+52%
+25%+2%
+66%+15%
12
Single node – CG
+1%-9%
+10%-20%
Not CPU bound:•Little time penalty•Large energy savings
13
Operations per miss
Metric for memory pressure Must be independent of time Uses hardware performance counters
Micro-operations x86 instructions become one or more micro-operations Better measure of CPU activity
Operations per miss (subset of NAS)
Suggestion: Decrease gear as ops/miss decreases
EP BT LU MG SP CG
844 79.6 73.5 70.6 49.5 8.60
14
Single node – LU
+4%-8%
+10%-10%
Modest memory pressure:Gears offer E-T tradeoff
15
Ops per miss, LU
16
Results – LU
Shift 0/1+1%, -6%
Auto shift+3%, -8%
Gear 1+5%, -8%
Gear 2+10%, -10%
Shift 1/2+1%, -6%
Shift 0/2+5%, -8%
17
Bottlenecks
Intra-node Memory Disk
Inter-node Communication Load (im)balance
18
Multiple nodes – EP
S2 = 2.0S4 = 4.0
S8 = 7.9
Perfect speedup:E constantas N increases
E = 1.02
19
Multiple nodes – LU
S2 = 1.9E2 = 1.03
S4 = 3.3E4 = 1.15
S8 = 5.8E8 = 1.28
Good speedup:E-T tradeoffas N increases
S8 = 5.3E8 = 1.16
Gear 2
20
Multiple nodes – MG
S2 = 1.2E2 = 1.41
S4 = 1.6E4 = 1.99
S8 = 2.7 E8 = 2.29
Poor speedup:Increased Eas N increases
21
Normalized – MG
With communicationbottleneck E-Ttradeoff improvesas N increases
22
Jacobi iteration
Can increase Ndecrease T and
decrease E
23
Future work
We are working on inter-node bottleneck
24
Safe overprovisioning
25
The problem Peak power limit, P
Rack power Room/utility Heat dissipation
Static solution, number of servers is N = P/Pmax
Where Pmax maximum power of individual node
Problem Peak power > average power (Pmax > Paverage) Does not use all power – N * (Pmax - Paverage) unused Under performs – performance proportional to N Power consumption is not predictable
26
Safe over provisioning in a cluster Allocate and manage power among M > N nodes
Pick M > NEg, M = P/Paverage
MPmax > P Plimit = P/M
Goal Use more power, safely under limit Reduce power (& peak CPU performance) of individual
nodes Increase overall application performance
time
pow
er Pmax
Paverage
P(t)
time
pow
er
PlimitPaverage
P(t)
Pmax
27
Safe over provisioning in a cluster Benefits
Less “unused” power/energy More efficient power use
More performance under same power limitation Let P be performance Then more performance means: MP * > NP Or P */ P > N/M or P */ P > Plimit/Pmax
time
pow
er Pmax
Paverage
P(t)
time
pow
er
PlimitPaverage
P(t)
Pmax
unusedenergy
28
When is this a win?
When P */ P > N/M or P */ P > Plimit/Pmax
In words: power reduction more than performance reduction
Two reasons:1. Frequency and voltage scaling2. Application throughput
performance (freq)
pow
er
ap
plic
ati
on t
hro
ug
hput
P * / P
< P av
erag
e/P m
ax
P * / P
> P av
erag
e/P m
ax
performance (freq)
(1)
(2)
29
Feedback-directed, adaptive power control
Uses feedback to control power/energy consumption Given power goal Monitor energy consumption Adjust power/performance of CPU Paper: [COLP ’02]
Several policies Average power
Maximum power
Energy efficiency: select slowest gear (g) such that
30
Implementation Components
Two components Integrated into one daemon process
Daemons on each node Broadcasts information at intervals Receives information and calculates Pi for next interval Controls power locally
Research issues Controlling local power
Add guarantee, bound on instantaneous power Interval length
Shorter: tighter bound on power; more responsiveLonger: less overhead
The function f(L0, …, LM)Depends on relationship between power-performance
interval (k)
Pik
Individual power limit for node i
31
Results – fixed gear
0 1
23
4
5
6
32
Results – dynamic power control
0 1
23
4
5
6
33
Results – dynamic power control (2)
0 1
23
4
5
6
34
Summary
35
End
36
Summary
Safe over provisioning Deploy M > N nodes More performance
Less “unused” powerMore efficient power use
Two autonomic managersLocal: built on prior researchGlobal: new, distributed algorithm
Implementation Linux AMD
Contact: Vince Freeh, 513-7196, [email protected]
37
Autoshift
38
Phases
39
Allocate power based on energy efficiency
Allocate power to maximize throughput Maximize number of tasks completed per unit energy Using energy-time profiles
Statically generate table for each taskTuple (gear, energy/task)
Modifications Nodes exchange pending tasks Pi determined using table and population of tasks
Benefit Maximizes task throughput
Problems Must avoid starvation
40
Memory bandwidth
41
Power management –ICK: need better 1st slide
What Controlling power Achieving desired goal
Why Conserve energy consumption Contain instantaneous power consumption Reduce heat generation Good engineering
42
Related work: Energy conservation
Goal: conserve energy Performance degradation
acceptable Usually in mobile environments
(finite energy source, battery)
Primary goal: Extend battery life
Secondary goal: Re-allocate energy Increase “value” of energy use
Tertiary goal: Increase energy efficiency More tasks per unit energy
Example Feedback-driven, energy
conservation Control average power
usage Pave= (E0 – Ef)/T
E0
Ef
T
power
freq
43
Related work: Realtime DVS
Goal: Reduce energy consumption With no performance
degradation
Mechanism: Eliminate slack time in system
Savings Eidle
with F scaling Additional Etask –Etask’
with V scaling
T
P
Etask
deadlin
e
Pmax
T
P
Etask’ deadlin
e
Pmax
Eidle
44
Related work: Fixed installations
Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery
Mechanisms Scaling
Fine-grain – DVSCoarse-grain – power down
Load balancing
45
Single node – MG
46
Single node – EP
47
Single node – LU
48
Power, energy, heat – oh, my
Relationship E = P * T H E Thus: control power
Goal Conserve (reduce) energy consumption Reduce heat generation Regulate instantaneous power consumption
Situations (benefits) Mobile/embedded computing (finite energy store) Desktops (save $) Servers, etc (increase performance)
49
Power usage
CPU power Dominated by dynamic power
System power dominated by CPU Disk Memory
CPU notes Scalable Driver of other system Measure of performance
performance (freq)
pow
er
CMOS dynamic power equation:
P = ½CfV2
50
Power management in HPC
Goals Reduce heat generation (and $) Increase performance
Mechanisms Scaling Feedback Load balancing
51
Single node – MG
+6%-7% +12%
-8%
Modest memory pressure:Gears offer E-T tradeoff
52
Power management vs. energy conservation Power management is mechanism Energy conservation is policy
Two elements Energy efficiency
Ie, Decrease energy consumed per task (Instantaneous) power consumption
Ie, Limit maximum Watts used
Power-performance tradeoff Less power & less performance Ultimately energy-time
Power management
2GHz800MHz
AMD system
6 gears
53
Autonomic managers
Implementation uses two autonomic managers Local – power control Global – power allocation
Local Uses prior research project (new implementation) Requires new policy Daemon process
Reads power meterAdjusts processor performance gear (freq)
Global At regular intervals
Collects appropriate information from all nodesAllocates power budget for next quantum
Optimize for one of several objectives
54
Example: Load imbalance
Uniform allocation of power Pi = Plimit = P/M, for node i Not ideal if nodes unevenly loaded
Tasks execute more slowly on busy nodesLightly loaded nodes may not use all power
Allocate power based on load* At regular intervals, nodes exchange load information Each computes individual power limit for next interval (k)
*Note: Load is one of several possible objective functions.
individual power limit for node i at interval k
Ensure: