Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly.
-
Upload
eustace-sparks -
Category
Documents
-
view
217 -
download
2
Transcript of Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly.
23 October 2006 HEPSysMan @ Cambridge
RAL Tier-1
• RAL hosts the UK WLCG Tier-1– Funded via GridPP2 project from PPARC– Supports WLCG and UK Particle Physics users and
collaborators• VOs:
– LHC: Atlas, CMS, LHCb, Alice, (dteam, ops)– Babar CDF, D0, H1, Zeus– bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno,
t2k, …• Other experiments:
– Mice, SNO, UKQCD• Theory users• …
23 October 2006 HEPSysMan @ Cambridge
Staff / Finance
• Bid to PPARC for ‘GridPP3’ project– For exploitation phase of LHC– September 2007 to March 2011– Increase in staff and hardware resources– Result early 2007
• Tier-1 is recruiting– 2 x systems admins, 1 x hardware technician– 1 x grid deployment – Replacement for Steve Traylen to head grid deployment
and user support group• CCLRC internal reorganisation
– Business Units• Tier1 service is run by E-Science department which is now part
of the Facilities Business Unit (FBU)
23 October 2006 HEPSysMan @ Cambridge
New building• Funding approved for a new computer centre building
– 3 floors• Computer rooms on ground floor, offices above
– 240m2 low power density room• Tape robots, disk servers etc• Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012
– 490m2 high power density room• Servers, CPU farms, HPC clusters
» Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012
– UPS computer room• 8 racks + 3 telecoms racks• UPS system to provide continuous power of 400A/92KVA three phase for
equipment plus power to air conditioning (total approx 800A/184KVA)– Overall
• Space for 300 racks (+ robots, telecoms)• Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con)• UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific
hardware for clean shutdown / surviving short breaks– Shared with HPC and other CCLRC computing facilities– Planned to be ready by summer 2008
23 October 2006 HEPSysMan @ Cambridge
Hardware changes
• FY05/06 capacity procurement March 06– 52 x 1U twin dual-core AMD 270 units
• Tyan 2882 motherboard• 4GB RAM, 250GB SATA HDD, dual 1GB NIC• 208 job slots, 200kSI2K• Commissioned May 06, running well
– 21 x 5U 24-bay disk servers• 168TB (210TB) data capacity• Areca 1170 PCI-X 24-port controller• 22 x 400GB (500GB) SATA data drives, RAID 6 • 2 x 250GB SATA system drives, RAID 1• 4GB RAM, dual 1GB NIC• Commissioning delayed (more…)
23 October 2006 HEPSysMan @ Cambridge
Hardware changes (2)• FY 06/07 capacity procurements
– 47 x 3U 16-bay disk servers: 282TB data capacity• 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller• 14 x 500GB SATA data drives, RAID 5 • 2 x 250GB SATA system drives, RAID 1• Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC• Delivery expected October 06
– 64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K)• 4GB RAM, 250GB SATA HDD, dual 1GB NIC• Delivery expected November 06
• Upcoming in FY 06/07:– Further 210TB disk capacity expected December 06
• Same spec as above– High Availability systems with UPS
• Redundant PSUs, hot-swap paired HDDs etc– AFS replacement– Enhancement to Oracle services (disk arrays or RAC servers)
23 October 2006 HEPSysMan @ Cambridge
Hardware changes (3)
• SL8500 tape robot– Expanded from 6,000 to 10,000 slots– 10 drives shared between all users of
service– Additional 3 x T10K tape drives for PP– More when CASTOR service working
• STK Powderhorn– Decommissioned and removed
23 October 2006 HEPSysMan @ Cambridge
Storage commissioning
• Problems with March 06 procurement:– WD4000YR on Areca 1170, RAID 6
• Many instances of multiple drive dropouts• Un-warranted drive dropouts and then re-integrating the same
drive– Drive electronics (ASIC) on 4000YR (400GB) units changed
with no change of model designation• We got the updated units
– Firmware updates to Areca cards did not solve the issues– WD5000YS (500GB) units swapped-in by WD
• Fixes most issues but…– Status data and logs from drives showing several additional
problems• Testing under high load to gather statistics
– Production further delayed
23 October 2006 HEPSysMan @ Cambridge
Air-con issues• Setup
– 13 x 80KW units in lower machine room, several paired units work together• Several ‘hot’ days (for the UK) in July
– Sunday: dumped ~70 jobs• Alarm system failed to notify operators• Pre-emptive automatic shutdown not triggered• Ambient air temp reached >35C, machine exhaust temperature >50C !• HPC services not so lucky
– Mid week 1: problems over two days• attempts to cut load by suspending batch services to protect data services• forced to dump 270 jobs
– Mid week 2: 2 hot days predicted• pre-emptive shutdown of batch services in lower machine room• no jobs lost, data services remain available
• Problem– High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits– Cascade failure as individual air-con units work harder– Loss of control of machine room temperature
• Solutions– Sprinklers under units
• Successful but banned due to Health and Safety concerns – Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature
23 October 2006 HEPSysMan @ Cambridge
Operating systems
• Grid services, batch workers, service machines– SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86– SL4 before Xmas
• Considering x86_64• Disk storage
– SL4 migration in progress• Tape systems
– AIX: caches– Solaris: controller– SL3/4: CASTOR systems, newer caches
• Oracle systems– RHEL3/4
• Batch system– Torque/MAUI
• Fare-shares, allocation by User Board
23 October 2006 HEPSysMan @ Cambridge
Databases
• 3D project– Participating since early days
• Single Oracle server for testing• Successful
– Production service• 2 x Oracle RAC clusters
– Two servers per RAC» Redundant PSUs, hot-swap RAID1 system drives
– Single SATA/FC data array– Some transfer rate issues– UPS to come
23 October 2006 HEPSysMan @ Cambridge
Storage Resource Management
• dCache– Performance issues
• LAN performance very good• WAN performance and tuning problems
– Stability issues– Now better:
• increased number of open file descriptors• increased number of logins allowed.
• ADS– In-house system many years old
• Will remain for some legacy services• CASTOR2
– Replace both dCache disk and tape SRMs for major data services– Replace T1 access to existing ADS services– Pre-production service for CMS– LSF for transfer scheduling
23 October 2006 HEPSysMan @ Cambridge
Monitoring
• Nagios– Production service implemented– 3 servers (1 master + 2 slaves)– Almost all systems covered
• 600+
– Replacing SURE– Add call-out facilities
23 October 2006 HEPSysMan @ Cambridge
Networking• All systems have 1Gb/s connections
– Except oldest fraction of the batch farm• 10GB/s links almost everywhere
– 10Gb/s backbone within Tier-1• Complete November 06• Nortel 5530/5510 stacks
– 10Gb/s link to RAL site backbone• 10Gb/s backbone links at RAL expected end November 06• 10Gb/s link to RAL Tier-2
– 10Gb/s link to UK academic network SuperJanet5 (SJ5)• Expected in production by end of November 06• Firewall still an issue
– Planned bypass for Tier1 data traffic as part of RAL<->SJ5 and RAL backbone connectivity developments
– 10Gb/s OPN link to CERN active• September 06• Using pre-production SJ5 circuit• Production status at SJ5 handover
23 October 2006 HEPSysMan @ Cambridge
Security
• Notified of intrusion at Imperial College London• Searched logs
– Unauthorised use of account from suspect source– Evidence of harvesting password maps– No attempt to conceal activity– Unauthorised access to other sites– No evidence of root compromise
• Notified sites concerned– Incident widespread
• Passwords changed– All inactive accounts disabled
• Cleanup– Changed NIS to use shadow password map– Reinstall all interactive systems