The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

10
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services

Transcript of The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Page 1: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

The SLAC Cluster

Chuck Boeheim

Assistant Director, SLAC Computing Services

Page 2: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Components

Solaris Farm 900 single CPU units Linux Farm 512 dual CPU units AFS 7 servers, 3 TB NFS 21 servers, 16 TB Objectivity 94 servers, 52 TB LSF Master, backup, license HPSS Master + 10 tape movers Interactive 25 servers, + E10000 Build Farm 12 servers Network 9 Cisco 6509 switches

Page 3: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Staffing

System Admin 7

Mass Storage 3

Applications 3

Batch 1

Operations 4

Operators 0

• Same staff supports most Unix desktops on site

Page 4: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Growth in Systems

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1998 1999 2000 2001

Nu

mb

er o

f S

yste

ms

Page 5: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Growth in Staffing

0

2

4

6

8

10

12

14

16

18

20

1998 1999 2000 2001

Sta

ff S

ize

Page 6: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Ratio of Systems/Staff

0

10

20

30

40

50

60

70

80

90

100

1998 1999 2000 2001

Sys

tem

s/S

taff

Page 7: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Physical

Racking, power, cooling, seismic, network Remote power management Remote console management Installation

Burn-in, DOAs Maintenance

Replacement burn-in Divergence from original models

Locating a machine

Page 8: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

Networking

Gb to servers 100Mb to farm nodes Speed matching (problems) at switches Network glitches and storms Network monitoring

Page 9: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

System Admin

Network install (256 machines in < 1 hr) Patch management Power Up/Down Nightly maintenance System Ranger (monitor) Report summarization“A Cluster is a large Error Amplifier”

Page 10: The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.

User Application Issues

Workload scheduling Startup effects Distribution vs Hot Spots System and Network Limits

File descriptors Memory Cache contention NIS, DNS, AMD Job Scheduling

Test Beds