RAL Tier1/A Site Report

18-20 October 2004

HEPiX - Brookhaven

RAL Tier1/A Site Report

Martin BlyHEPiX – Brookhaven National Laboratory

18-20 October 2004

18-20 October 2004

HEPiX - Brookhaven

Overview

• Introduction• Hardware• Software• Security

18-20 October 2004

HEPiX - Brookhaven

RAL Tier1/A

• RAL the Tier 1 centre in the UK– Supports all VOs but priority to ATLAS, CMS,

LHCb– LCG Core site

• Babar collaboration Tier A• Support for other experiments:

– D0, H1, SNO, UKQCD, MINOS, Zeus, Theory, …

• Various test environments for grid projects

18-20 October 2004

HEPiX - Brookhaven

Pre-Grid Upgrade

RAL Linux CSF : Weekly CPU Utilisation Financial Year 2000/01

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

3-Apr-00

1-May-00

29-May-00

26-Jun-00

24-Jul-00

21-Aug-00

18-Sep-00

16-Oct-00

13-Nov-00

11-Dec-00

8-Jan-01

5-Feb-01

5-Mar-01

Pla

tfo

rm-r

ela

ted

CP

U H

ou

rs

1 July 2000

1 October 2000

18-20 October 2004

HEPiX - Brookhaven

Post-GRID Upgrade

GRID Load 21-28 July

Full again in 8 hours!

18-20 October 2004

HEPiX - Brookhaven

LCG in Production

• Since June Tier1 LCG service has evolved to become a full scale production facility– Sort of sneaked up on us! Gradual change

from test/development environment to full scale production.

– Availability and reliability of the LCG service are now a high priority for RAL staff.

– Now the largest single CPU resource at RAL

18-20 October 2004

HEPiX - Brookhaven

GRID Production

18-20 October 2004

HEPiX - Brookhaven

Hardware

• Main Farms: 884 CPUs, approx 880kSI2K– 312 CPUs x P3 @ 1.4GHz, – 160 CPUs x P4/Xeon @ 2.66GHz, HT off– 512 CPUs x P4/Xeon @ 2.8GHz, HT off

• Disk: approx 226TB– 52 x 800GB R5 IDE/SCSI arrays, – 22 x 2TB R5 IDE/SCSI arrays, – 40 x 4TB R5 EonStor SATA/SCSI arrays

• Tape: – 6000 slot Powderhorn Silo, 200GB/tape, 8 drives.

• Misc: – SUN disk servers, AIX (AFS cell)– 140 CPUs x P3 @ 1GHz

18-20 October 2004

HEPiX - Brookhaven

Hardware Issues

• CPU and disks delivered June 16• CPU units:

– 6 in 256 failed under testing – memory, motherboard

– Installed into production after ~4 weeks• Disk systems:

– Riser cards failing. Looks to be the batch.– Issues with EonStor firmware – fixes from

vendor– Into production about now

18-20 October 2004

HEPiX - Brookhaven

Enhancements

• FY 2004/05 CPU/disk procurement starting shortly– expect lower volume of CPU and disk– CPU technology: Xeon/Opteron– Disk technology: SATA/SCSI, SATA/FC, …

• Sun systems services and data migrating to SL3– mail, NIS -> SL3– data -> RH7.3, SL3– Due Xmas ’04.

• AFS cell migration to SL3/OpenAFS • Investigating SANs, iSCSI, SAS

18-20 October 2004

HEPiX - Brookhaven

Environment

• Farms dispersed over three machine rooms• Extra temporary air conditioning capacity for summer

– Actually survived with it mostly idle!

• New air conditioning for lower machine room (A5L), independent from main building air-con system. 5 Units, 400kW; arrives November

• Extra power distribution (but not new power)• All new rack kit to be located in A5L, shared with

other high availability services (HPC etc).• Issues:

– New Nocona chips use more power – and create more heat – Rack weight on raised floors – latest kit is around 8 tonnes– Air con unit weight + power

18-20 October 2004

HEPiX - Brookhaven

18-20 October 2004

HEPiX - Brookhaven

Network

• Site link – 2.5Gb/s to TVN• Site backbone @ 1Gb/s.• Tier1/A backbone @ 1Gb/s on Summit 7i and 3Com switches.

– Latest purchases have single or dual 1Gb/s NIC– All batch workers connected @ 100Mb/s to 3Com fan-out switches

with 1Gb/s uplink– Disk servers connected @ 1Gb/s to backbone switches

• Upgrades– All new hardware to have 1Gb/s NIC– Upgrade CPU rack network switches where necessary to 1Gb/s fan-

out– New backbone switches:

• stackable units with 40Gb/s interlink and where possible, with 10Gb/s upgrade path to site router

• Joining UKLight network– 10Gb/s– Fewer hops to HEP sites– Multiple Gb/s links to Tier1/A

18-20 October 2004

HEPiX - Brookhaven

Software

• Transition to SL3• Farms:

– Scientific Linux 3 (Fermi)• Babar batch, prototype frontend

– RedHat 7.n• 7.3: LCG batch, Tier1 batch, frontend systems• 7.2: Babar frontend systems

• Servers:– SL3

• Systems services (mail, NIS, loggers, scheduler)– Redhat 7.2/7.3

• Disk servers (custom Kernels)– Fedora Core

• Consoles, personal desktops– Solaris 2.6, 8, 9

• SUN systems– AIX

• AFS cell

18-20 October 2004

HEPiX - Brookhaven

Software Issues

• SL3– Easy to install with PXE/Kickstart– Migration of Babar community from RH 7.3 batch service smooth

after installation validated by Babar for batch work– Batch system using Torque/Maui versions from LCG rebuilt for

SL3, with some local patches to config parameters (more jobs, more classes). Stable.

• RedHat 7.n– Security a big concern (!)

• Speed of patching• Custom kernels a problem

• Enterprise (RHEL, SL)– Disk i/o (both read and write) performance not as good as can be

achieved with RH 7.n (9). (SL, 2.4.21-15.0.n)• Need to test the more recent kernels

– NFS, LVM and Megaraid controllers don’t mix!

18-20 October 2004

HEPiX - Brookhaven

Projects

• Quattor– Ongoing preparation for implementation

• Infrastructure data challenge– Joining effort to test high speed / high availability / high

bandwidth data transfers to simulate LCG requirements• RSS news service• dCache

– disk pool manager with SRM combined– Software complex to configure

• Multiple layers – difficult to drill down to find exactly why a problem has occurred, somewhat sensitive to hardware/system configurations

– Working test deployment• 1 head node, 2 pool nodes

– Next steps: • create a multi-terabyte instance for CMS in LCG

18-20 October 2004

HEPiX - Brookhaven

Security

• Firewall at RAL is default Deny inbound– Keeps many but not all badguys™ out– Specific hosts have inbound Permit for specific

ports• Sets of rules for LCG components (CE, SE, RB etc) or

services (AFS)

– Outbound: generally open, port 80 via cache– X11 port was open but not to Tier1/A (closed

1997!)• Now closed site-wide as of 8th Oct

• The badguys™ still get in…

18-20 October 2004

HEPiX - Brookhaven

Recent Incident (1)

• Keyboard logger installed at a remote site A exposes password of account at remote site B

• Access to exposed@siteB– Scans account known_hosts for possible targets

• exposed@siteB has ssh keys unprotected by a pass-phrase– Unchallenged access to any account@host on list

in known_hosts on which unprotected public key installed

– !”£$%^&*#@;¬?>|

18-20 October 2004

HEPiX - Brookhaven

Recent Incident (2)

• Aug 26 at 23:05 BST, Badguy™ uses unprotected key of compromised account at remote site B to enter two systems at RAL: RedHat 7.2 systems.

• Downloads custom IRC bot based on Energy Mech– Contains a klogd binary which is the IRC bot

• Possibly tries for privilege escalation• Installs IRC bot (klogd), attempting to usurp the

system klogd or possibly other rogue klogds. Fails to kill system klogd.

• Two klogd now running: system on owned by root and badguy™ version owned by compromised user.

• At some time later the directory containing the bot code (/tmp/.mc) is deleted.

18-20 October 2004

HEPiX - Brookhaven

Recent Incident (3)

• Oct 7, am: we are told system has been exhibiting suspicious activity by legitimate remote IRC server admins who are monitoring for suspicious activity. Systems removed from network and forensic investigation begins

• Dump of bot/klogd process shows 4800+ hosts listed – it appears system was part of an IRC network– Badguy™ bot/klogd listens on ports tcp:8181 and udp:34058– Contacts IRC servers at 4 addresses (port 6667), as "XzIbIt"

• Firewall logs show relatively small amount of traffic from affected host

• No trace of root exploits• Second host was a user frontend system: no evidence

of any IRC activity or root compromise

18-20 October 2004

HEPiX - Brookhaven

Lessons

• Unprotected ssh keys are bad news– If it is unprotected on your system then all keys owned

everywhere by that user are likely unprotected too• Use ssh-agent or similar

– There are still .netrc files in use for production userids• Communication

– Lack of news from upstream sites a disappointment• If we had been told of exploit at the remote site and the time

frames involved we would have found the IRC bot within hours

• Protect infrastructure from user accessible hosts– Firewalling

• Staff time: 2-3 staff weeks

RAL Tier1/A Site Report

Documents

Transcript of RAL Tier1/A Site Report