The RHIC-ATLAS Computing Facility at BNL

The RHIC-ATLAS Computing Facility at BNL

HEPIX – Edinburgh May 24-28, 2004

Tony Chan

RHIC Computing Facility

Brookhaven National Laboratory

Outline

BackgroundMass StorageCentral Disk StorageLinux FarmMonitoringSecurity & AuthenticationFuture DevelopmentsSummary

Background

Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory.

RACF formed in the mid-90’s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90’s.

RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

Background (continued)

Currently 29 staff members (4 new hires in 2004).

RHIC Year 4 just concluded. Performance surpassed all expectations.

Staff Growth at the RACF

0

5

10

15

20

25

30St

aff L

evel

s (FT

E)

1997 1998 1999 2000 2001 2002 2003 2004YearStaff Levels

RACF Structure

Mass Storage

4 StorageTek tape silos managed via HPSS (v 4.5).

Using 37 9940B drives (200 GB/tape).

Aggregate bandwidth up to 700 MB/s.

10 data movers with 10 TB of disk.

Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).

The Mass Storage System

Central Disk Storage

Large SAN served via NFS DST + user home directories + scratch area.

41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually.

24 Brocade switches & 250 TB of FB RAID5 managed by Veritas.

Aggregate 600 MB/s data rate to/from Sun servers on average.

Central Disk Storage (cont.)

RHIC and ATLAS AFS cells software repository + user home directories.

Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS.

Transarc on server side, OpenAFS on client side.

Considering OpenAFS for server side.

The Central Disk Storage System

Linux Farm

Used for mass processing of data.

1359 rack-mounted, dual-CPU (Intel) servers.

Total of 1362 kSpecInt2000.

Reliable (about 6 hardware failures per month at current farm size).

Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.

Linux Farm (cont.)

Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd.

Requires significant infrastructure resources (network, power, cooling, etc).

Significant scalability challenges. Advance planning and careful design a must!

The Growth of the Linux Farm

0

200

400

600

800

1000

1200

1400K

Spec

Int2

000

1999 2000 2001 2002 2003 2004YearKSpecInt2000

The Linux Farm in the RACF

Linux Farm Software

Custom RH 8 (RHIC) and 7.3 (ATLAS) images.

Installed with Kickstart server.

Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel).

Support for network file systems (AFS, NFS) and local data storage.

Linux Farm Batch Management

New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm.

Use of Condor DAGman functionality to handle job dependencies.

New system solves scalability problems of old system.

Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).

Linux Farm Batch Management (cont.)

Linux Farm Batch Management (cont.)

LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week.

LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week.

Condor and LSF accepting jobs through GLOBUS.

Condor scalability to be tested in ATLAS DC 2.

Condor Usage at the RACF

Monitoring

Mix of open-source, RCF-designed and vendor-provided monitoring software.

Persistency and fault-tolerant features.

Near real-time information.

Scalability requirements.

Mass Storage Monitoring

Central Disk Storage Monitoring

Linux Farm Monitoring

Temperature Monitoring

Security & Authentication

Two layers of firewall with limited network services and limited interactive access through secure gateways.

Migration to Kerberos 5 single sign-on and consolidation of password DB’s. NIS passwords to be phased-out.

Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor.

Implemented Kerberos certificate authority.

Future Developments

HIS/HTAR deployment for UNIX-like access to HPSS.

Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc).

dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm.

Linux Farm OS upgrade plans (RHEL?).

US ATLAS Grid Testbed

Internet

HPSS

Condor pool

GatekeeperJob manager

DisksGrid Job Requests

Globus-client

17TB

70MB/S

atlas02

aafs

amds Mover

aftpexp00

GridFtp

giis01Information Server

AFS serverGlobus RLS

Server

Local Grid development currently focused on monitoring, user management and support for DC2 production activities

Summary

RHIC run very successful.

Increasing staff levels to support increasing level of computing support activities.

On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment.

Increased activity to support upcoming ATLAS DC 2 production.

The RHIC-ATLAS Computing Facility at BNL

Documents

Transcript of The RHIC-ATLAS Computing Facility at BNL