Lesson learned after our recent cooling problem
description
Transcript of Lesson learned after our recent cooling problem
Lesson learned after our recent cooling problem
Michele Onofri, Stefano Zani,Andrea Chierici
HEPiX Spring 2014
Andrea Chierici 2
Outline
• INFN-T1 on-call procedure• Incident• Recover Procedure• What we learned• Conclusions
21/05/2013
INFN-T1 on-call procedure
Andrea Chierici 4
On-call service
• CNAF staff on-call on a weekly basis– 2/3 times per year– Must live within 30min from CNAF– Service phone receiving alarm SMSes – Periodic training on security and intervention
procedures• 3 incidents in last three years– only this last one required the site to be totally
powered off 21/05/2013
Andrea Chierici 5
Service Dashboard
21/05/2013
Incident
Andrea Chierici 7
What happened on the 9th of March
• 1.08am: fire alarm– On-call person intervenes and calls Firefighters
• 2.45am: fire extinguished• 3.18am: high temp warning
– Air conditioning blocked– On-call person calls for help
• 4.40am: decision is taken to shut down the center• 12.00pm: chiller under maintenance• 17.00pm: chiller fixed, center can be turned back on• 21.00pm: farm back on-line, waiting for storage
21/05/2013
Andrea Chierici 8
10th of March
• 9.00am: support call to switch storage back on• 6.00pm: center open again for LHC
experiments
• Next day: center fully open again
21/05/2013
Andrea Chierici 9
Chiller power supply
21/05/2013
Andrea Chierici 10
Incident representation
21/05/2013
Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6
ControlSystem Head
Ctrl sys
Pow 1
Ctrl sys
Pow 2
Andrea Chierici 11
Incident examination
• 6 chillers for the computing room• 5 share the same power supply for the control
logic (we did not know that!)• Fire in one of the control logic, power was cut to 5
chillers out of 6– 1 chiller was still working and we weren’t aware of
that!– Could have avoided turning the whole center off?
Probably not! But a controlled shutdown could have been done.
21/05/2013
Andrea Chierici 12
Facility monitoring app
21/05/2013
Andrea Chierici 13
Chiller n.4
21/05/2013
BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)
Andrea Chierici 14
Incident seen by inside
21/05/2013
Andrea Chierici 15
Incident seen by outside
21/05/2013
Recover Procedure
Andrea Chierici 17
Recover procedure
• Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4
• Storage: support call • Farming: took the chance to apply all security
patches and latest kernel to nodes– Switch on order: LSF server, CEs, UIs– For a moment we were thinking about upgrading to
LSF 9
21/05/2013
Andrea Chierici 18
Failures (1)
• Old WNs – BIOS battery exhausted,
configuration reset• PXE boot, hyper-threading, disk
configuration (AHCI)– lost IPMI configuration (30%
broken)
21/05/2013
Andrea Chierici 19
Failures (2)
• Some storage controllers were replaced
• 1% PCI cards (mainly 10Gbit network) replaced
• Disks, power supplies and network switches were almost not damaged
21/05/2013
What we learned
Andrea Chierici 21
We fixed our weak point
21/05/2013
Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6
ControlSystem Head
Ctrl sys
Pow 1
Ctrl sys
Pow 6
Ctrl sys
Pow 2
Ctrl sys
Pow 3
Ctrl sys
Pow 4
Ctrl sys
Pow 5
Andrea Chierici 22
We miss an emergency button
• Shut the center down is not easy: a real “emergency shutdown” procedure is missing– We could have avoided switching
down the whole center if we have had more control
– Depending on the incident, some services may be left on-line
• Person on-call can’t know all the site details
21/05/2013
Andrea Chierici 23
Hosted services
• Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control– We need an emergency
procedure for those too– We need a better
understanding of the SLAs
21/05/2013
Conclusions
Andrea Chierici 25
We benchmarked ourselves
21/05/2013
• It took 2 days to get the center back on-line– less than one to open LHC
experiments– everyone was aware about
what to do– All working nodes rebooted
with a solid configuration– A few nodes were
reinstalled and put back on line in a few minutes
Andrea Chierici 26
Lesson learned
• We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now)– The new dashboard appears to be the right place
• We created a task-force to implement a controlled shutdown procedure– Establish a shutdown order
• WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches
• In case of emergency, on-call person is required to take a difficult decision
21/05/2013
Andrea Chierici 27
Testing shutdown procedure
• The shutdown procedure we are implementing can’t be easily tested
• How to perform a “simulation”?– Doesn’t sound right to switch the center off just to
prove we can do it safely• How do other sites address this?• Should periodic bios battery replacements be
scheduled?
21/05/2013