RAL Tier1/A Site Report
description
Transcript of RAL Tier1/A Site Report
18-20 October 2004
HEPiX - Brookhaven
RAL Tier1/A Site Report
Martin BlyHEPiX – Brookhaven National Laboratory
18-20 October 2004
18-20 October 2004
HEPiX - Brookhaven
Overview
• Introduction• Hardware• Software• Security
18-20 October 2004
HEPiX - Brookhaven
RAL Tier1/A
• RAL the Tier 1 centre in the UK– Supports all VOs but priority to ATLAS, CMS,
LHCb– LCG Core site
• Babar collaboration Tier A• Support for other experiments:
– D0, H1, SNO, UKQCD, MINOS, Zeus, Theory, …
• Various test environments for grid projects
18-20 October 2004
HEPiX - Brookhaven
Pre-Grid Upgrade
RAL Linux CSF : Weekly CPU Utilisation Financial Year 2000/01
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
3-Apr-00
1-May-00
29-May-00
26-Jun-00
24-Jul-00
21-Aug-00
18-Sep-00
16-Oct-00
13-Nov-00
11-Dec-00
8-Jan-01
5-Feb-01
5-Mar-01
Pla
tfo
rm-r
ela
ted
CP
U H
ou
rs
1 July 2000
1 October 2000
18-20 October 2004
HEPiX - Brookhaven
Post-GRID Upgrade
GRID Load 21-28 July
Full again in 8 hours!
18-20 October 2004
HEPiX - Brookhaven
LCG in Production
• Since June Tier1 LCG service has evolved to become a full scale production facility– Sort of sneaked up on us! Gradual change
from test/development environment to full scale production.
– Availability and reliability of the LCG service are now a high priority for RAL staff.
– Now the largest single CPU resource at RAL
18-20 October 2004
HEPiX - Brookhaven
GRID Production
18-20 October 2004
HEPiX - Brookhaven
Hardware
• Main Farms: 884 CPUs, approx 880kSI2K– 312 CPUs x P3 @ 1.4GHz, – 160 CPUs x P4/Xeon @ 2.66GHz, HT off– 512 CPUs x P4/Xeon @ 2.8GHz, HT off
• Disk: approx 226TB– 52 x 800GB R5 IDE/SCSI arrays, – 22 x 2TB R5 IDE/SCSI arrays, – 40 x 4TB R5 EonStor SATA/SCSI arrays
• Tape: – 6000 slot Powderhorn Silo, 200GB/tape, 8 drives.
• Misc: – SUN disk servers, AIX (AFS cell)– 140 CPUs x P3 @ 1GHz
18-20 October 2004
HEPiX - Brookhaven
Hardware Issues
• CPU and disks delivered June 16• CPU units:
– 6 in 256 failed under testing – memory, motherboard
– Installed into production after ~4 weeks• Disk systems:
– Riser cards failing. Looks to be the batch.– Issues with EonStor firmware – fixes from
vendor– Into production about now
18-20 October 2004
HEPiX - Brookhaven
Enhancements
• FY 2004/05 CPU/disk procurement starting shortly– expect lower volume of CPU and disk– CPU technology: Xeon/Opteron– Disk technology: SATA/SCSI, SATA/FC, …
• Sun systems services and data migrating to SL3– mail, NIS -> SL3– data -> RH7.3, SL3– Due Xmas ’04.
• AFS cell migration to SL3/OpenAFS • Investigating SANs, iSCSI, SAS
18-20 October 2004
HEPiX - Brookhaven
Environment
• Farms dispersed over three machine rooms• Extra temporary air conditioning capacity for summer
– Actually survived with it mostly idle!
• New air conditioning for lower machine room (A5L), independent from main building air-con system. 5 Units, 400kW; arrives November
• Extra power distribution (but not new power)• All new rack kit to be located in A5L, shared with
other high availability services (HPC etc).• Issues:
– New Nocona chips use more power – and create more heat – Rack weight on raised floors – latest kit is around 8 tonnes– Air con unit weight + power
18-20 October 2004
HEPiX - Brookhaven
18-20 October 2004
HEPiX - Brookhaven
Network
• Site link – 2.5Gb/s to TVN• Site backbone @ 1Gb/s.• Tier1/A backbone @ 1Gb/s on Summit 7i and 3Com switches.
– Latest purchases have single or dual 1Gb/s NIC– All batch workers connected @ 100Mb/s to 3Com fan-out switches
with 1Gb/s uplink– Disk servers connected @ 1Gb/s to backbone switches
• Upgrades– All new hardware to have 1Gb/s NIC– Upgrade CPU rack network switches where necessary to 1Gb/s fan-
out– New backbone switches:
• stackable units with 40Gb/s interlink and where possible, with 10Gb/s upgrade path to site router
• Joining UKLight network– 10Gb/s– Fewer hops to HEP sites– Multiple Gb/s links to Tier1/A
18-20 October 2004
HEPiX - Brookhaven
Software
• Transition to SL3• Farms:
– Scientific Linux 3 (Fermi)• Babar batch, prototype frontend
– RedHat 7.n• 7.3: LCG batch, Tier1 batch, frontend systems• 7.2: Babar frontend systems
• Servers:– SL3
• Systems services (mail, NIS, loggers, scheduler)– Redhat 7.2/7.3
• Disk servers (custom Kernels)– Fedora Core
• Consoles, personal desktops– Solaris 2.6, 8, 9
• SUN systems– AIX
• AFS cell
18-20 October 2004
HEPiX - Brookhaven
Software Issues
• SL3– Easy to install with PXE/Kickstart– Migration of Babar community from RH 7.3 batch service smooth
after installation validated by Babar for batch work– Batch system using Torque/Maui versions from LCG rebuilt for
SL3, with some local patches to config parameters (more jobs, more classes). Stable.
• RedHat 7.n– Security a big concern (!)
• Speed of patching• Custom kernels a problem
• Enterprise (RHEL, SL)– Disk i/o (both read and write) performance not as good as can be
achieved with RH 7.n (9). (SL, 2.4.21-15.0.n)• Need to test the more recent kernels
– NFS, LVM and Megaraid controllers don’t mix!
18-20 October 2004
HEPiX - Brookhaven
Projects
• Quattor– Ongoing preparation for implementation
• Infrastructure data challenge– Joining effort to test high speed / high availability / high
bandwidth data transfers to simulate LCG requirements• RSS news service• dCache
– disk pool manager with SRM combined– Software complex to configure
• Multiple layers – difficult to drill down to find exactly why a problem has occurred, somewhat sensitive to hardware/system configurations
– Working test deployment• 1 head node, 2 pool nodes
– Next steps: • create a multi-terabyte instance for CMS in LCG
18-20 October 2004
HEPiX - Brookhaven
Security
• Firewall at RAL is default Deny inbound– Keeps many but not all badguys™ out– Specific hosts have inbound Permit for specific
ports• Sets of rules for LCG components (CE, SE, RB etc) or
services (AFS)
– Outbound: generally open, port 80 via cache– X11 port was open but not to Tier1/A (closed
1997!)• Now closed site-wide as of 8th Oct
• The badguys™ still get in…
18-20 October 2004
HEPiX - Brookhaven
Recent Incident (1)
• Keyboard logger installed at a remote site A exposes password of account at remote site B
• Access to exposed@siteB– Scans account known_hosts for possible targets
• exposed@siteB has ssh keys unprotected by a pass-phrase– Unchallenged access to any account@host on list
in known_hosts on which unprotected public key installed
– !”£$%^&*#@;¬?>|
18-20 October 2004
HEPiX - Brookhaven
Recent Incident (2)
• Aug 26 at 23:05 BST, Badguy™ uses unprotected key of compromised account at remote site B to enter two systems at RAL: RedHat 7.2 systems.
• Downloads custom IRC bot based on Energy Mech– Contains a klogd binary which is the IRC bot
• Possibly tries for privilege escalation• Installs IRC bot (klogd), attempting to usurp the
system klogd or possibly other rogue klogds. Fails to kill system klogd.
• Two klogd now running: system on owned by root and badguy™ version owned by compromised user.
• At some time later the directory containing the bot code (/tmp/.mc) is deleted.
18-20 October 2004
HEPiX - Brookhaven
Recent Incident (3)
• Oct 7, am: we are told system has been exhibiting suspicious activity by legitimate remote IRC server admins who are monitoring for suspicious activity. Systems removed from network and forensic investigation begins
• Dump of bot/klogd process shows 4800+ hosts listed – it appears system was part of an IRC network– Badguy™ bot/klogd listens on ports tcp:8181 and udp:34058– Contacts IRC servers at 4 addresses (port 6667), as "XzIbIt"
• Firewall logs show relatively small amount of traffic from affected host
• No trace of root exploits• Second host was a user frontend system: no evidence
of any IRC activity or root compromise
18-20 October 2004
HEPiX - Brookhaven
Lessons
• Unprotected ssh keys are bad news– If it is unprotected on your system then all keys owned
everywhere by that user are likely unprotected too• Use ssh-agent or similar
– There are still .netrc files in use for production userids• Communication
– Lack of news from upstream sites a disappointment• If we had been told of exploit at the remote site and the time
frames involved we would have found the IRC bot within hours
• Protect infrastructure from user accessible hosts– Firewalling
• Staff time: 2-3 staff weeks