Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.
-
Upload
james-larson -
Category
Documents
-
view
225 -
download
0
Transcript of Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.
Liverpool HEP - Site Report
June 2008
Robert Fay, John Bland
Staff Status
One members of staff left in the past year:
• Paul Trepka, left March 2008
Two full time HEP system administrators
• John Bland, Robert Fay
One full time Grid administrator currently being hired
* Closing date for applications was Friday 13th, 15 applications received
One part time hardware technician
• Dave Muskett
Current HardwareDesktops
• ~100 Desktops: Scientific Linux 4.3, Windows XP
• Minimum spec of 2GHz x86, 1GB RAM + TFT Monitor
Laptops
• ~60 Laptops: Mixed architectures, specs and OSes.
Batch Farm
• Software repository (0.7TB), storage (1.3TB)
• Old ‘batch’ queue has 10 SL3 dual 800MHz P3s with 1GB RAM
• ‘medium’, ‘short’ queues consist of 40 SL4 MAP-2 nodes (3GHz P4s)
• 5 interactive nodes (dual Xeon 2.4GHz)
• Using Torque/PBS
• Used for general analysis jobs
Current hardware – continuedMatrix
• 1 dual 2.40GHz Xeon, 1GB RAM
• 6TB RAID array
• Used for CDF batch analysis and data storage
HEP Servers
* 4 core servers
• User file store + bulk storage via NFS (Samba front end for Windows)
• Web (Apache), email (Sendmail) and database (MySQL)
• User authentication via NIS (+Samba for Windows)
• Dual Xeon 2.40GHz shell server and ssh server
• Core servers have a failover spare
Current Hardware - continuedLCG Servers
• CE, SE upgraded to new hardware:
• CE now 8-core Xeon 2 GHz, 8GB RAM
• SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10 array
• CE, SE, UI all SL4, GLite 3.1
• Mon still SL3, GLite 3.0
• BDII SL4, Glite 3.0
Current Hardware – continuedMAP2 Cluster
• 24 rack (960 node) (Dell PowerEdge 650) cluster
• 4 racks (280 nodes) shared with other departments
• Each node has 3GHz P4, 1GB RAM, 120GB local storage
• 19 racks (680 nodes) primarily for LCG jobs (5 racks currently allocated for local ATLAS/T2K/Cockcroft batch processing)
• 1 rack (40 nodes) for general purpose local batch processing
• Front end machines for ATLAS, T2K, Cockcroft
• Each rack has two 24 port gigabit switches
• All racks connected into VLANs via Force10 managed switch
StorageRAID
• All file stores are using at least RAID5. Newer servers using RAID6.
• All RAID arrays using 3ware 7xxx/9xxx controllers on Scientific Linux 4.3.
• Arrays monitored with 3ware 3DM2 software.
File stores
• New User and critical software store, RAID6+HS, 2.25TB
• ~10B general purpose ‘hepstores’ for bulk storage
• 1.4TB + 0.7TB batchstore+batchsoft for the Batch farm cluster
• 1.4TB hepdata for backups
• 37TB RAID6 for LCG storage element
Storage (continued)3ware Problems!
• 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
• 3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
• 3w-9xxx: scsi0: AEN: ERROR: (0x04:0x005F): Cache synchronization failed; some data lost:unit=0.
• Leads to total loss of data access until system is rebooted.
• Sometimes leads to data corruption at array level.
• Seen under iozone load, normal production load, due to drive failure.
• Anyone else seen this?
NetworkTopology
Force10GigabitSwitch
WANfirewall
LCG servers
MAP2
Offices Servers
2GB2GB
VLAN1GB link
Network (continued)Core Force10 E600 managed switch.
• Now have 450 gigabit ports (240 at line rate)
• Used as central departmental switch, using VLANs
• Increased bandwidth to WAN using link aggregation to 2-3GBit/s
• Increased to departmental backbone to 2GBit/s
• Added departmental firewall/gateway
• Network intrusion monitoring with snort
• Most office PCs and laptops are on internal private network
• Building network infrastructure is creaking
- needs rewiring, old cheap hubs and
switches need replacing
Security & MonitoringSecurity
• Logwatch (looking to develop filters to reduce ‘noise’)
• University firewall + local firewall + network monitoring (snort)
• Secure server room with swipe card access
Monitoring
• Core network traffic usage monitored with ntop and cacti (all traffic to be monitored after network upgrade)
• Use sysstat on core servers for recording system statistics
• Rolling out system monitoring on all servers and worker nodes, using SNMP, Ganglia, Cacti, and Nagios
• Hardware temperature monitors on water cooled racks, to be supplemented by software monitoring on nodes via SNMP. Still investigating other environment monitoring solutions.
System Management• Puppet used for configuration management
• Dotproject used for general helpdesk
• RT integrated with Nagios for system management
- Nagios automatically creates/updates tickets on acknowledgement
- Each RT ticket serves as a record for an individual system
PlansAdditional storage for the Grid
• GridPP3 funded
• Will be approx. 60? TB
• May switch from dCache to DPM
Upgrades to local batch farm
• Plans to purchase several multi-core (most likely 8-core) nodes
Collaboration with local Computing Services Department
• Share of their newly commissioned multi-core cluster available