The RHIC-ATLAS Computing Facility at BNL
description
Transcript of The RHIC-ATLAS Computing Facility at BNL
The RHIC-ATLAS Computing Facility at BNL
HEPIX – Edinburgh May 24-28, 2004
Tony Chan
RHIC Computing Facility
Brookhaven National Laboratory
Outline
BackgroundMass StorageCentral Disk StorageLinux FarmMonitoringSecurity & AuthenticationFuture DevelopmentsSummary
Background
Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory.
RACF formed in the mid-90’s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90’s.
RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).
Background (continued)
Currently 29 staff members (4 new hires in 2004).
RHIC Year 4 just concluded. Performance surpassed all expectations.
Staff Growth at the RACF
0
5
10
15
20
25
30St
aff L
evel
s (FT
E)
1997 1998 1999 2000 2001 2002 2003 2004YearStaff Levels
RACF Structure
Mass Storage
4 StorageTek tape silos managed via HPSS (v 4.5).
Using 37 9940B drives (200 GB/tape).
Aggregate bandwidth up to 700 MB/s.
10 data movers with 10 TB of disk.
Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).
The Mass Storage System
Central Disk Storage
Large SAN served via NFS DST + user home directories + scratch area.
41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually.
24 Brocade switches & 250 TB of FB RAID5 managed by Veritas.
Aggregate 600 MB/s data rate to/from Sun servers on average.
Central Disk Storage (cont.)
RHIC and ATLAS AFS cells software repository + user home directories.
Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS.
Transarc on server side, OpenAFS on client side.
Considering OpenAFS for server side.
The Central Disk Storage System
Linux Farm
Used for mass processing of data.
1359 rack-mounted, dual-CPU (Intel) servers.
Total of 1362 kSpecInt2000.
Reliable (about 6 hardware failures per month at current farm size).
Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.
Linux Farm (cont.)
Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd.
Requires significant infrastructure resources (network, power, cooling, etc).
Significant scalability challenges. Advance planning and careful design a must!
The Growth of the Linux Farm
0
200
400
600
800
1000
1200
1400K
Spec
Int2
000
1999 2000 2001 2002 2003 2004YearKSpecInt2000
The Linux Farm in the RACF
Linux Farm Software
Custom RH 8 (RHIC) and 7.3 (ATLAS) images.
Installed with Kickstart server.
Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel).
Support for network file systems (AFS, NFS) and local data storage.
Linux Farm Batch Management
New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm.
Use of Condor DAGman functionality to handle job dependencies.
New system solves scalability problems of old system.
Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).
Linux Farm Batch Management (cont.)
Linux Farm Batch Management (cont.)
LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week.
LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week.
Condor and LSF accepting jobs through GLOBUS.
Condor scalability to be tested in ATLAS DC 2.
Condor Usage at the RACF
Monitoring
Mix of open-source, RCF-designed and vendor-provided monitoring software.
Persistency and fault-tolerant features.
Near real-time information.
Scalability requirements.
Mass Storage Monitoring
Central Disk Storage Monitoring
Linux Farm Monitoring
Temperature Monitoring
Security & Authentication
Two layers of firewall with limited network services and limited interactive access through secure gateways.
Migration to Kerberos 5 single sign-on and consolidation of password DB’s. NIS passwords to be phased-out.
Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor.
Implemented Kerberos certificate authority.
Future Developments
HIS/HTAR deployment for UNIX-like access to HPSS.
Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc).
dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm.
Linux Farm OS upgrade plans (RHEL?).
US ATLAS Grid Testbed
Internet
HPSS
Condor pool
GatekeeperJob manager
DisksGrid Job Requests
Globus-client
17TB
70MB/S
atlas02
aafs
amds Mover
aftpexp00
GridFtp
giis01Information Server
AFS serverGlobus RLS
Server
Local Grid development currently focused on monitoring, user management and support for DC2 production activities
Summary
RHIC run very successful.
Increasing staff levels to support increasing level of computing support activities.
On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment.
Increased activity to support upcoming ATLAS DC 2 production.