Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.

Liverpool HEP - Site Report

June 2008

Robert Fay, John Bland

Staff Status

One members of staff left in the past year:

• Paul Trepka, left March 2008

Two full time HEP system administrators

• John Bland, Robert Fay

One full time Grid administrator currently being hired

* Closing date for applications was Friday 13th, 15 applications received

One part time hardware technician

• Dave Muskett

Current HardwareDesktops

• ~100 Desktops: Scientific Linux 4.3, Windows XP

• Minimum spec of 2GHz x86, 1GB RAM + TFT Monitor

Laptops

• ~60 Laptops: Mixed architectures, specs and OSes.

Batch Farm

• Software repository (0.7TB), storage (1.3TB)

• Old ‘batch’ queue has 10 SL3 dual 800MHz P3s with 1GB RAM

• ‘medium’, ‘short’ queues consist of 40 SL4 MAP-2 nodes (3GHz P4s)

• 5 interactive nodes (dual Xeon 2.4GHz)

• Using Torque/PBS

• Used for general analysis jobs

Current hardware – continuedMatrix

• 1 dual 2.40GHz Xeon, 1GB RAM

• 6TB RAID array

• Used for CDF batch analysis and data storage

HEP Servers

* 4 core servers

• User file store + bulk storage via NFS (Samba front end for Windows)

• Web (Apache), email (Sendmail) and database (MySQL)

• User authentication via NIS (+Samba for Windows)

• Dual Xeon 2.40GHz shell server and ssh server

• Core servers have a failover spare

Current Hardware - continuedLCG Servers

• CE, SE upgraded to new hardware:

• CE now 8-core Xeon 2 GHz, 8GB RAM

• SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10 array

• CE, SE, UI all SL4, GLite 3.1

• Mon still SL3, GLite 3.0

• BDII SL4, Glite 3.0

Current Hardware – continuedMAP2 Cluster

• 24 rack (960 node) (Dell PowerEdge 650) cluster

• 4 racks (280 nodes) shared with other departments

• Each node has 3GHz P4, 1GB RAM, 120GB local storage

• 19 racks (680 nodes) primarily for LCG jobs (5 racks currently allocated for local ATLAS/T2K/Cockcroft batch processing)

• 1 rack (40 nodes) for general purpose local batch processing

• Front end machines for ATLAS, T2K, Cockcroft

• Each rack has two 24 port gigabit switches

• All racks connected into VLANs via Force10 managed switch

StorageRAID

• All file stores are using at least RAID5. Newer servers using RAID6.

• All RAID arrays using 3ware 7xxx/9xxx controllers on Scientific Linux 4.3.

• Arrays monitored with 3ware 3DM2 software.

File stores

• New User and critical software store, RAID6+HS, 2.25TB

• ~10B general purpose ‘hepstores’ for bulk storage

• 1.4TB + 0.7TB batchstore+batchsoft for the Batch farm cluster

• 1.4TB hepdata for backups

• 37TB RAID6 for LCG storage element

Storage (continued)3ware Problems!

• 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

• 3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.

• 3w-9xxx: scsi0: AEN: ERROR: (0x04:0x005F): Cache synchronization failed; some data lost:unit=0.

• Leads to total loss of data access until system is rebooted.

• Sometimes leads to data corruption at array level.

• Seen under iozone load, normal production load, due to drive failure.

• Anyone else seen this?

NetworkTopology

Force10GigabitSwitch

WANfirewall

LCG servers

MAP2

Offices Servers

2GB2GB

VLAN1GB link

Network (continued)Core Force10 E600 managed switch.

• Now have 450 gigabit ports (240 at line rate)

• Used as central departmental switch, using VLANs

• Increased bandwidth to WAN using link aggregation to 2-3GBit/s

• Increased to departmental backbone to 2GBit/s

• Added departmental firewall/gateway

• Network intrusion monitoring with snort

• Most office PCs and laptops are on internal private network

• Building network infrastructure is creaking

- needs rewiring, old cheap hubs and

switches need replacing

Security & MonitoringSecurity

• Logwatch (looking to develop filters to reduce ‘noise’)

• University firewall + local firewall + network monitoring (snort)

• Secure server room with swipe card access

Monitoring

• Core network traffic usage monitored with ntop and cacti (all traffic to be monitored after network upgrade)

• Use sysstat on core servers for recording system statistics

• Rolling out system monitoring on all servers and worker nodes, using SNMP, Ganglia, Cacti, and Nagios

• Hardware temperature monitors on water cooled racks, to be supplemented by software monitoring on nodes via SNMP. Still investigating other environment monitoring solutions.

System Management• Puppet used for configuration management

• Dotproject used for general helpdesk

• RT integrated with Nagios for system management

- Nagios automatically creates/updates tickets on acknowledgement

- Each RT ticket serves as a record for an individual system

PlansAdditional storage for the Grid

• GridPP3 funded

• Will be approx. 60? TB

• May switch from dCache to DPM

Upgrades to local batch farm

• Plans to purchase several multi-core (most likely 8-core) nodes

Collaboration with local Computing Services Department

• Share of their newly commissioned multi-core cluster available

Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.

Documents

Transcript of Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.