Green High Performance Computing Initiative at...
Transcript of Green High Performance Computing Initiative at...
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No. 0758596.
China – US Software Workshop, September 2011
Green High Performance Computing Initiative at Rutgers University (GreenHPC@RU)
Manish Parashar, Ivan Rodero
Contact: {parashar, irodero}@cac.rutgers.edu http://nsfcac.rutgers.edu/GreenHPC
Motivations for HPC Power Management q Growing scales of High-end High Performance Computing systems
q Power consumption has become critical in terms of operational costs (dominant part of IT budgets)
q Existing power/energy management focused on single layers q May result in suboptimal solutions
Challenges in HPC Power Management q GreenHPC
q $/W/MFLOP, defining energy efficiency q Infrastructure: thermal sensors, instrumentation, monitoring, cooling,
etc. q Architectural challenges
q Processor, memory, interconnect technologies q Increased use of accelerators q Power-aware micro-architectures
q Compilers, OS, runtime support q Power-aware code generation q Power-aware scheduling q DVFS, programming models, abstractions
q Considerations for system level energy efficiency q Optimizing CPU alone is not sufficient, need to look at entire system/
cluster q Application/workload aware-optimizations
q Power-aware algorithm design
Motivations and Challenges Cross-layer Energy-Power Management q Architecture
Component-based Power Management [HiPC’10/11] q Application-centric aggressive power management
at component level q Workload profiling and characterization
q HPC workloads (e.g., HPC Linpack) q Use of low power modes to configure subsystems
(i.e., CPU, memory, disk and NIC) q Energy-efficient memory hierarchies
q Memory power management for multi- many-core systems q Multiple channels to main memory q Application-driven power control (i.e., ensure bandwidth to main memory leveraging channel affinity)
Runtime Power Management q Partitioned Global Address Space (PGAS)
q Implicit message-passing q Unified Parallel C (UPC) so far
q Target platforms q Many-core (i.e., Intel SCC) q HPC clusters
Energy-aware Autonomic Provisioning [IGCC’10] q Virtualized Cloud infrastructures with multiple
geographical distributed entry points. q Workloads composed of HPC applications q Distributed Online Clustering (DOC)
q Cluster job requests in the input stream based on their resource requirements (multiple dimensions, e.g., memory, CPU, network requirements)
q Optimizing energy efficiency in the following ways: q Powering down subsystems when they are not needed q Efficient, just-right VM provisioning (reduce over-provisioning ) q Efficient proactive provisioning and grouping (reduce re-provisioning)
Energy and Thermal Autonomic Management q Reactive thermal and energy management of HPC workloads
q Autonomic decision making to react to thermal hotspots considering multiple dimensions (i.e., energy and thermal efficiency) using different techniques: VM migration, CPU DVFS, CPU pinning
q Proactive energy-aware application-centric VM allocation for HPC workloads q Strategy for proactive VM allocation based on VM consolidation while
satisfying QoS guarantees q Based on application profiling (profiles are known in advance) and
benchmarking on real hardware
q Abnormal operational state detection (e.g., poor performance, hotspots)
q Reactive and proactive approaches q Reacting to anomalies to return to steady state q Predict anomalies in order to avoid them
q Workload/application-awareness q Application profiling/characterization
Approach
QoS
Energy E!ciency
Thermal e!ciency
Abnormal
state
Steady State
…
?
QoS
Energy E!ciency
Thermal e!ciency
QoS
Energy E!ciency
Thermal e!ciency
q Autonomic (self-monitored and self-managed) computing systems
q Optimizing (minimizing): q Energy efficiency q Cost-effectiveness q Utilization
while ensuring (maximizing): q Performance/quality of service delivered
q Addressing both “traditional” and virtualized system (i.e., data centers and GreenHPC in the Cloud)
Goals
Physical Environment Sensor
!Observer!
!Actuator
!Controller!
!
Resources
Sensor
!Observer!
!Actuator
!Controller!
!
Virtualization
Sensor
!Observer!
!Actuator
!Controller!
!
Application/ Workload Sensor
!Observer!
!Actuator
!Controller!
!
Cross-layer Power Management
Cross-infrastructure Power Management
Cloud
(private, public,
hybrid, etc.)
Cloud
(private, public,
hybrid, etc.)
Virtualized Instrumented infrastructure
Application-aware
1 2 3 4 5 6 7 8 910111213141516
0
5
10
15
20
25
30
0
2
4
6
8
10
12
14
Program Region
Channels
Perc
enta
ge o
f cha
nnel
acce
ss
Opportunities for powering down
! " # $ % & '
()"!%
!
!*&
"
"*&
#
#*&
$()"!
%
+,-)./012.
3/0
456)./012.
! "!!! #!!!! #"!!!!
!$"
#
#$"
%
%$"
&'(#!
)
*+,(-./01-
2./
345(-./01-
VM classes based on clusters
Over-provisioning cost
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09
Dura
tion
(s)
Time (ns)
CG.A.O3.16 execution. Call’s to UPC Runtime function upcr_wait2 in node 1
Example: Potential saving during collective operations (e.g., barriers) in the SCC platform