The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
-
Upload
kerry-jordan -
Category
Documents
-
view
215 -
download
2
Transcript of The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
The SLAC Cluster
Chuck Boeheim
Assistant Director, SLAC Computing Services
Components
Solaris Farm 900 single CPU units Linux Farm 512 dual CPU units AFS 7 servers, 3 TB NFS 21 servers, 16 TB Objectivity 94 servers, 52 TB LSF Master, backup, license HPSS Master + 10 tape movers Interactive 25 servers, + E10000 Build Farm 12 servers Network 9 Cisco 6509 switches
Staffing
System Admin 7
Mass Storage 3
Applications 3
Batch 1
Operations 4
Operators 0
• Same staff supports most Unix desktops on site
Growth in Systems
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1998 1999 2000 2001
Nu
mb
er o
f S
yste
ms
Growth in Staffing
0
2
4
6
8
10
12
14
16
18
20
1998 1999 2000 2001
Sta
ff S
ize
Ratio of Systems/Staff
0
10
20
30
40
50
60
70
80
90
100
1998 1999 2000 2001
Sys
tem
s/S
taff
Physical
Racking, power, cooling, seismic, network Remote power management Remote console management Installation
Burn-in, DOAs Maintenance
Replacement burn-in Divergence from original models
Locating a machine
Networking
Gb to servers 100Mb to farm nodes Speed matching (problems) at switches Network glitches and storms Network monitoring
System Admin
Network install (256 machines in < 1 hr) Patch management Power Up/Down Nightly maintenance System Ranger (monitor) Report summarization“A Cluster is a large Error Amplifier”
User Application Issues
Workload scheduling Startup effects Distribution vs Hot Spots System and Network Limits
File descriptors Memory Cache contention NIS, DNS, AMD Job Scheduling
Test Beds