5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

12
5 Sept 2006 GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague

Transcript of 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Page 1: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

5 Sept 2006 GDB meeting BNL, MIlos Lokajicek

Service planning and monitoring in T2 - Prague

Page 2: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Overview

• Introduction• Service planning and current status

– Capacities– Networking– Personnel

• Monitoring– HW and SW– Middleware– Service

• Remarks

Page 3: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Introduction

• Czech Republic’s LHC activities– ATLAS, target 3% of authors -> activities– ALICE, target 1 %– TOTEM, much smaller experiments, relative target higher.– (non LHC – HERA/H1, TEVATRON/D0, AUGER)

• Institutions (mention just big groups)– Academy of Sciences of the Czech Republic

• Institute of Physics• Nuclear Physics Institute

– Charles University in Prague• Faculty of Mathematics and Physics

– Czech Technical University in Prague• Faculty of Nuclear Sciences and Physical Engineering

• HEP manpower (2005)– 145 people

• 59 physicists• 22 engineers• 21 technicians• 43 undergraduate students a PHD students

Page 4: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Service planning

2007 2008 2009 2010

ATLAS + ALICE        

CPU (MSI2000) 0.4 1.9 2.9 4.7

Disk (TB) 202 904 1 420 2 354

MSS (TB) 120 562 1 013 1 652

• Table based on LCG MoU for ATLAS and Alice and our anticipated share• Project proposals to various grant systems in the Czech Republic• Prepare bigger project proposal for CZ GRID together with CESNET

– For the LHC needs– In 2010 add 3x more capacity for Czech non-HEP scientists, financed fro state resources and structural

funds of EU• All proposals include new personnel (up to 10 new persons)

• Today, regular financing, sufficient for D0– today 250 cores, 150 kSI2k, 40 TB disk space, no tapes

Page 5: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Networking

• Local connection of institutes in Prague– Optical 1 Gbps

E2E lines

• WAN– Opticla E2E lines

to Fermilab, Taipei new FZK (from 1 Sept 06)

– Connection Prague – Amsterodam now through GN2

– Planning further lines to other T1s Sima @ CEF Networks workshop Prauge, May 30th, 2006

Page 6: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Personnel

• Now 4 persons to run T2– Jiri Kosina – middleware (leaving, looking for replacement),

Storage (FTS), monitoring– Tomas Kouba – middleware, monitoring– Jan Svec – basic HW, OS, storage, networking, monitoring– Lukas Fiala - Basic HW, networking, web services

– Jiri Chudoba – liason to ATLAS and ALICE, running the jobs and reporting errors, service monitoring

• Further information is based on their experience

Page 7: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Monitoring

• HW and basic SW– installation and test of new hardware

normally choose proven HWHW - installation by delivery firminstall operating system and solve problems with delivery firmsinstall middlewaretest it for some time outside the production service

– Nagiosworking nodes access via pingdisks – how the partitions are fullload averageif pbs_mom process is runningnumber of running processesif ssh demon is runninghow full is the swap….

• Limits for warning and error• Distribution of mails or SMS to admins – fixing problems remotely• Regular check of nagios web page for red dots

– Regular automatic (cron) checks and restarts for some daemons

Page 8: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Monitoring

• PBS – job count (via RRD and mrtg)– Local tools for monitoring of number of jobs/machine/per chosen period

• Apel – not much useful, might be setup for more useful info

• Gridice• ATLAS

– Checks and statistics from ATLAS database• ALICE - Mona Lisa - very useful• Monitor pool accounts and actual user certificates

• Networking– Network traffic to FZK, SARA, CERN in certain ip range– With the help of ipaccounting (utility ipac-ng)

http://golias100.farm.particle.cz/ipac/

• SFT – site functional tests – very useful

Page 9: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.
Page 10: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

outgoing to fzk1 Max: 37M Average: 6M Total: 129G

outgoing to internetMax: 61M Average: 8M Total: 164G

Page 11: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Updates and patches

• YAIM + automated updates on all farm nodes using simple BEX script toolkit (takes care of upgrading the node which was switched off at the deployment/upgrade phase ... keeps all nodes in sync automatically)ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz, info in README file

Page 12: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Service monitoring

• Using higher described checks and their combinations• Rely on centrally/by experiments supported useful monitors

• We would appreciate to receive early warning if jobs on some site/working_nodes start quickly fail after submission

• Service requirements for T2s in “extended”working hours– No special plan today– Try to provide architecture that responsible people can even travel and

do as much as possible remotely (e.g. network console access)– Future computing capacities will probably require new arrangements