Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate...
-
Upload
morris-tyler -
Category
Documents
-
view
213 -
download
0
Transcript of Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate...
Software Architecture for Software Architecture for Dynamic Thermal Dynamic Thermal
Management in DatacentersManagement in Datacenters
Tridib MukherjeeTridib Mukherjee
Graduate Research AssistantGraduate Research Assistant
IMPACT Lab (www.impact.asu.edu)IMPACT Lab (www.impact.asu.edu)
Department of Comp. Sc. & Engg.Department of Comp. Sc. & Engg.
Arizona State UniversityArizona State University
22
OutlineOutline
MotivationMotivation
Dynamic Thermal Management in Dynamic Thermal Management in DatacentersDatacenters
Thermal-aware task schedulingThermal-aware task scheduling
Software ArchitectureSoftware Architecture
Conclusions and Future workConclusions and Future work
33
MotivationMotivation Computing clusters are increasingly Computing clusters are increasingly
deployed in current datacenters limited by deployed in current datacenters limited by power and thermal capacitypower and thermal capacity
• High server density to achieve higher High server density to achieve higher computation capability - computation capability - Leads to high Leads to high heat densityheat density
• Reliability and longevity of the overheated Reliability and longevity of the overheated servers is affected - servers is affected - System downtime System downtime may increasemay increase
Rising costRising cost for datacentersfor datacenters• Large scale datacenters can run into Large scale datacenters can run into
millions of dollars - millions of dollars - Cooling cost Cooling cost comprises almost half of thiscomprises almost half of this
• Current trend of overcooling based on Current trend of overcooling based on worst case thermal characteristics lead to worst case thermal characteristics lead to high utilities costhigh utilities cost
A dynamic thermal-aware A dynamic thermal-aware control platform is necessary for control platform is necessary for online thermal evaluation that can online thermal evaluation that can achieve a tradeoff between these achieve a tradeoff between these extremes.extremes.
44
Thermal Management of Thermal Management of Datacenter Datacenter
Motivation and significanceMotivation and significance Compute Intensive Applications (Online Gaming, Computer Movie Compute Intensive Applications (Online Gaming, Computer Movie
Animation, Data Mining) requiring increased utilization of Data Animation, Data Mining) requiring increased utilization of Data CenterCenter
• Maximizing computing capacity is a demanding requirementMaximizing computing capacity is a demanding requirement New blade servers can be packed more denselyNew blade servers can be packed more densely Energy cost is rising dramaticallyEnergy cost is rising dramatically
GoalGoal• Improving thermal performanceImproving thermal performance• Lowering hardware failure rateLowering hardware failure rate• Reducing energy costReducing energy cost
55
Typical layout of a datacenterTypical layout of a datacenter Rack outlet temperature TRack outlet temperature Toutout
Rack inlet temperature TRack inlet temperature Tinin
Air conditioner supply temperature TAir conditioner supply temperature Tss
66
Schematic View of Thermal ManagementSchematic View of Thermal Management
C o n tro l
F eed b ack
T ran sd u ce r
Se ns o r D ataD atabas e
C FD s im ulat io ns o f tware
P o lic yC o ntro l le r
M o abSc he dule r
O the r Im pac tfac to r s
C o lle c t ing e nviro nm e ntal data andlo ad info rm atio n f ro m s e ns o rs
`
C o rre lat io n o flo ad & po we r
C o s t Analys is
Sc he duling P o l ic y
C o ntro l P o l ic y
Inc o m ing tas k
O ns i te s urve y
M a p loa d to pow e rc ons um ption
P ro c e s sM igrat io n
H is to ry Se ns o r D ata
C ur re nt Se ns o r D ata
D atac enter
Abs trac t H e atM o de l
T arg e t
77
Research Issues of Thermal Research Issues of Thermal Management in DatacenterManagement in Datacenter
Abstract HeatFlow Model
Power & LoadCharacterization
Modeling Thermal Performance
Multiscale & Multimodal Info
Analysis
ThermalPerformanceEvaluation
CostOptimization
SchedulerOther Impact
Factors
Understanding
Control
88
Task scheduling and Thermal Distribution Co-Task scheduling and Thermal Distribution Co-
relationrelation
Reaction Reaction ChainChain
Scheduling Requirements
Real-time measurement
Online lightweight temperature prediction
Thermal-awareness in the scheduling decisions
Task Assignment
Power Consumption Distribution
TemperatureDistribution
Energy Cost
Task Assignment
Power Consumption Distribution
Inlet temperaturedistributionwithout Cooling
25C
25C
Cooling lowered Inlet temperature lowered Blow redline threshold
Demand forcooling load /energy
Demand forcooling load/energy
99
Thermal-aware scheduling TechniquesThermal-aware scheduling Techniques
Uniform Task distribution (UT) Uniform Task distribution (UT) Assigning all chassis the same amount of tasks Assigning all chassis the same amount of tasks
(power consumptions)(power consumptions)
Uniform Outlet Profile (UOP)Uniform Outlet Profile (UOP) Assigning tasks in a way trying to achieve outlet Assigning tasks in a way trying to achieve outlet
temperature balance (uniform distribution)temperature balance (uniform distribution)
Minimum Computing Energy (coolest inlet) (MCE)Minimum Computing Energy (coolest inlet) (MCE) Assigning tasks in a way to keep the number of Assigning tasks in a way to keep the number of
active (power on) chassis as small as possibleactive (power on) chassis as small as possible
Recirculation Minimized Scheduling (XInt)Recirculation Minimized Scheduling (XInt) Use profiling process to calculate cross Use profiling process to calculate cross
interference coefficientsinterference coefficients
1010
Total Energy Cost Total Energy Cost ComparisonsComparisons
1111
System Model & Cluster Set-upSystem Model & Cluster Set-up
Ne two rk
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
R e m o teC lie n t
S e rv e rs a g u a ro . fu lto n .a s u .e du
I n te l 6 4 - b itX eo n E M 6 4 TD u al- p r o c es s o rS er v er s
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
Po
we
rEd
ge
24
50
C h as s is
R ac k
C h as s is 0
C h as s is 4
R ac k 0 R ac k 3
S a g u a roC lu s te r
Saguaro Cluster is Saguaro Cluster is the main cluster the main cluster maintained by the maintained by the High Performance High Performance Computing Initiative Computing Initiative at ASU.at ASU.
• 4 racks, 5 chassis 4 racks, 5 chassis per rack, 10 dual-per rack, 10 dual-processors per processors per chassischassis
1212
Cluster Management S/W Cluster Management S/W InfrastructureInfrastructure
We used Moab scheduler for job allocation in this cluster.We used Moab scheduler for job allocation in this cluster.• Easy to useEasy to use• Provides good graphical interface in the form of Moab Provides good graphical interface in the form of Moab
Cluster Manager (MCM).Cluster Manager (MCM).• Job re-allocation is allowed based on priorityJob re-allocation is allowed based on priority• uses of the underlying resource management software uses of the underlying resource management software
(such as torque) and enforces the scheduling policies (such (such as torque) and enforces the scheduling policies (such as fair-share) selected from the GUIas fair-share) selected from the GUI
Thermal awareness is integrated into the Moab Thermal awareness is integrated into the Moab Scheduler.Scheduler.
• Priority is set as a function of temperature, utilization, etc.Priority is set as a function of temperature, utilization, etc.
PHP based datacenter visualization.PHP based datacenter visualization.
Moab Cluster Management GUI
Moab Server
Resource Management (Torque)
Data Center
1313
Chassis Level Sensor Data Chassis Level Sensor Data CollectionCollection
SNMP based script SNMP based script periodically queries periodically queries sensors and updates sensors and updates server databaseserver database
PHP script periodically PHP script periodically accesses the database accesses the database for presenting the for presenting the thermal history in the thermal history in the webpagewebpage
11 outlet Temperature sensors at back
of the chassis
3 housing Temperature sensors at middle
of the chassis
Sensor Placement at each chassis*
* There is only one inlet sensor at the front of the chassis
1414
Visualization and Scheduler Visualization and Scheduler IntegrationIntegration
Temperature data is Temperature data is included as Generic included as Generic Metric (GMETRIC) in Metric (GMETRIC) in Moab.Moab.
Node priority is set Node priority is set based on moab based on moab GMETRIC data. GMETRIC data.
1515
Putting it all together: Putting it all together: Software ArchitectureSoftware Architecture
M C M S er v er h is to r y in w eb p ag e
M o ab S c h ed u ler
T O R Q UE R es o u r c eM an ag er S er v er
His to r y o fS en s o r R ead in g
N ag io s S c r ip t
T O R Q UE R es o u r c eM an ag er C lien t
P HP S c r ip t
M o ab G M E T R I C D ataP r o v id er
R e m o te C lie n t
S e rv e rs a g u a ro . fu lto n .a s u .e du
I n te l 6 4 -bit X e o n EM 6 4 TD u a l-pro ce s s o r S e rv e rs
L o ca l D e s k to p1 2 9 .2 1 9 .3 3 .2 3 2
Presentation
Scheduling Control
DatacenterServers
Access data from the chassis level sensors
1616
Modularized Implementation of Thermal Modularized Implementation of Thermal Awareness in Task SchedulingAwareness in Task Scheduling
1717
ConclusionsConclusions
Proposed Architecture Proposed Architecture enables dynamic on-line thermal management during enables dynamic on-line thermal management during
datacenter operation.datacenter operation. provides visualization of thermal distributionprovides visualization of thermal distribution
Implemented in fully operational ASU Implemented in fully operational ASU datacenter.datacenter.
Prototype development and demonstration at the Prototype development and demonstration at the Research @ Intel day.Research @ Intel day.
Questions ??Questions ??