INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.

INFN – Tier1 Site Status Report

Vladimir Sapunenko on behalf of Tier1 staff

HEPiX, CERN 26-May-08

Overview

Introduction Infrastructural Expansion Farming Network Storage and databases Conclusions


Introduction Location: INFN - CNAF, Bologna

floor -2 ~ 1000 m2 hall in the basement Multi-Experiment Tier-1 (~20 VOs, including

LHC experiments, CDF, BABAR and others)Participating to LCG, EGEE, INFNGRID

projects One of the main nodes of the GARR network In a nutshell:

about 3 MSI2K with ~2000 CPU/cores, to be expanded ~9 MSI2K by June about 1PB of disk space (tender for further 1.6 PB) 1 PB tape library (additional 10 PB tape library by Q2 2008) Gigabit Ethernet network, 10 Gigabit links with some LAN and WAN

Resources are assigned to experiments on a yearly basis


Infrastructural Expansion Electrical power and cooling systems expansion, work is in progress right

now

Main constraints: Dimensional - Limited height of the rooms (hmax=260 cm) → no floating floor

Environmental - Noise and electromagnetic insulations due to the office and classroom proximity

Key aspects: reliability, redundancy, maintainability 2 (+1) transformers: 2500 kVA each (~ 2000 kW each) 2 rotating (no-break) electric generators (EG): UPS+EG in one machine to save room

1700 kVA each (~ 1360 kW each) to be integrated with the 1250 kVA EG already installed

2 independent electric lines feeding each row of racks: redundancy of 2n

7 chillers: ~ 2 MW at Tair = 40°C, 2 (+2) pumps for chilled water circulation

High capacity precision air conditioning units (50 kW each) inside the high density islands (APC)

Air treatment and conditioning (30 kW each) units outside the high density islands: UTA and UTL


room 1(migration,

then storage)

sites for chilled water piping

from/to floor -1 (chillers)

The TIER1 in 2009 (floor -2)

electricpanelboard

room

room 2 (farming)

sites not involved by

the expansion

Remote control of the systems’ critical points


Farming The farming service maintains all

computing resources and provides access to them. Main aspects are: Automatic unattended installation of all

nodes via Quattor. Advanced job scheduling via LSF

scheduler. Customized monitoring and accounting

system via RedEye, internally developed. Remote control of all computing resources

via KVM, IPMI, plus customized scripts. Resources can be accessed in 2 ways:

GRID: the preferred solution, using a so called “User Interface” node.

Requires to set up a VO. Secure, x509 certificates used for

authentication/authorization. Direct access to LSF: discouraged.

Faster, you simply require a UNIX access on a front-end machine

Limited to Tier1 only, insecure.


Node installation and configuration Quattor (www.quattor.org) is a CERN developed toolkit for

automatic unattended installation. Kickstart initial installation (RedHat) Quattor takes over after first reboot Node configured according to administrator requirements Very powerful, allows per-node customizations, but can also easily

install 1000 nodes with the same configuration in 1 hour Currently we only support Linux

The HEP community chose Scientific Linux (www.scientificlinux.org) Version currently deployed at CNAF: 4.x Identical to RedHat AS

Good hardware support Big software library available on-line

Supported Grid middleware is gLite 3.1 We install the SL CERN release (www.cern.ch/linux)

Some useful customizations


Job scheduling Job scheduling is done via the LSF scheduler

848 WNs, 2032 CPUs/Cores, 3728 Slots Queue abstraction:

one job queue per VO is deployed, no time oriented queues. Each experiment submits jobs to its own queue only. Resource utilization limits are set on per queue basis.

Hierarchical Fairshare scheduling is used in order to calculate the priority in resource access All slots are shared, no VO-dedicated resources, all nodes belong to a

single big cluster. One group per VO, subgroups supported. A share (namely a resource quota) is assigned to each group in a

hierarchical way. Priority is directly proportional to the share and inversely proportional

to the historical resource usage. MPI Jobs are supported


CNAF Tier1 KSpecInt2000 history

0100020003000400050006000700080009000

10000

Mar

-07

May

-07

Jul-0

7

Sep-0

7

Nov-07

Jan-

08

Mar

-08

May

-08

Jul-0

8

Before Migration

848 WNs, 2032

Cpus/Cores, 3728 Slots After Migration

452 WNs, 1380 Cpus/Cores,

2230 Slots

11 twin quadcoreservers added

476 WNs,540 Cpus/Cores

2450 Slots

Expected delivery(from new tender) of 6312.32 KSI2K

Declared/available KSI2K monitoring


7600

GARR

2x10Gb/s

10Gb

/s

ExtermeBD10808

2x10Gb

/s

LHC Network General Layout

10Gb/s

LHC-OPNdedicated link 10Gb/s

•T1-T1’s (except FZK)•T1-T2’s•CNAF General purpose

FZK

ExtermeBD8810

Worker Nodes

Worker Nodes

2x1Gb/s

2x1Gb/s

Extreme Summit450

Extreme Summit450

4x1Gb

/s

Extreme Summit450

Worker Nodes

4x1Gb/s

2x10

Gb/

s

Extreme Summit400

Storage Servers•Disk Servers•Castor StagersFiber

Channel

Storage Devices

SANExtreme

Summit400

In Case of network Congestion: Uplink upgrade from 4 x 1Gb/s to 10 Gb/s or 2x10Gb/s

FC director

LHC-OPNCNAF-FZK & T0-T1 BACKUP 10Gb/s

WAN


Storage @ CNAF Implementation of 3 Storage Classes needed for LHC

Disk0 Tape1 (D0T1) CASTOR (testing GPFS/TSM/StoRM) Space managed by system Data migrated to tapes and deleted from disk when staging area full

Disk1 Tape0 (D1T0) GPFS/StoRM Space managed by VO CMS, LHCb, Atlas

Disk1 Tape1 (D1T1) CASTOR (moving to GPFS/TSM/StoRM) Space managed by VO (i.e. if disk is full, copy fails) Large buffer of disk with tape back end and no garbage collector

Deployment of an Oracle database infrastructure for Grid applications back-ends.

Advanced backup service for both disk based and database based data Legato, RMAN, TSM (in the near future).


~ 40 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software)

CASTOR deployment

STK FlexLine 600, IBM FastT900

• Core services are on machines with SCSI disks, hardware RAID1, redundant power supplies

• tape servers and disk servers have lower level hardware, like WNs

15 tape servers

Sun Blade v100 with 2 internal IDE disks with software RAID1 running ACSLS 7.0 OS Solaris 9.0

• STK L5500 silos (5500 slots, 200GB cartridges, capacity ~1.1 PB )

• 16 tape drives, 3 Oracle databases (DLF, Stager, Nameserver)

• LSF plug-in for scheduling

• SRM v2 (2 front-ends), SRM v1 (phasing out)

SANSAN


Storage evolution Previous tests demonstrated weakness in the CASTOR

behavior, even if some issues are now solved, we want to investigate and deploy an alternative way of implementing D1T1 and D0T1 storage classes

Great expectations come from the use of TSM together with GPFS and StoRM

Ongoing integration tests GPFS/TSM/StoRM StoRM needs to be modified to support DxT1

Some not trivial modifications for D0T1 required

Short term solution for D1T1 based on customized scripts, has been successfully tested by LHCb

Solution for D0T1 much more complicated, at present under test


Why SToRM and GPFS/TSM? StoRM is a GRID enabled Storage Resource Manager (SRM v2.2)

allows Grid applications to interact with storage resources through standard POSIX calls.

GPFS 3.2 is the IBM high-performance cluster file system. Greatly reduced administrative overhead Redundancy on the level of IOserver failure HSM support and ILM features in both GPFS and TSM permits

creation of very efficient solution. GPFS in particular demonstrated robustness and high performances

GPFS showed better performance in SAN environment, as confronted to CASTOR, dCache and Xrootd solutions

Long experience at CNAF (> 3 years), ~ 27 GPFS file systems in production at CNAF (~ 720 net TB) mounted on all farm WNs

TSM is a High Performance Backup/Archiving solution from IBM TSM 5.5 implements HSM used also in HEP world (e.g. FZK, NDGF, CERN for backup)


GPFS deployment evolution Started from a single

cluster:all WNs and IO nodes in one cluster

Some manageability problem has been observed

Separated cluster of servers from one of WNs

Access to Remote cluster FS has proven to be as efficient as the local one

Decided to separate also cluster with HSM backend


Oracle Database Service Main goals: high availability, scalability, reliability Achieved through a modular architecture based on the following building

blocks: Oracle ASM for storage management implementation of redundancy

and striping in an Oracle oriented way Oracle Real Application Cluster (RAC) the database is shared across

several nodes with failover and load balancing capabilities Oracle Streams geographical data redundancy

ASM

RAC

32 server, 19 of them configured in 7 cluster 40 database instances Storage: 5TB (20TB raw) Availability rate: 98,7% in 2007 Availability (%) = Uptime/(Uptime + Target Downtime + Agent

Downtime)


Backup At present, backup on tape based on Legato Networker

3.3 Database on-line backup through RMAN, one copy is also

stored on tape via Legato-RMAN plug-in Future migration to IBM TSM is foreseen

Certified interoperability between GPFS and TSM TSM provides not only backup and archiving methods

but also migrations capabilities Possible to exploit TSM migration in order to implement

D1T1 and D0T1 storage classes StoRM/GPFS/TSM integration


Conclusions INFN – Tier1 is facing a big infrastructural improvement

which will allow to fully meet the experiment requirements for LHC

Farming and network services are already pretty consolidated and are able to grow in term of computing capacity and network bandwidth without deep structural modifications

Storage service has achieved a good degree of stability, last issues are mainly due to implementation of D0T1 and D1T1 storage classes An integration between StoRM, GPFS and TSM is under

development and promises to be a definitive solution for the outstanding problems

INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.

Documents

Transcript of INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.