5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.
-
Upload
imogen-blair -
Category
Documents
-
view
214 -
download
0
Transcript of 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.
![Page 1: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/1.jpg)
5 Sept 2006 GDB meeting BNL, MIlos Lokajicek
Service planning and monitoring in T2 - Prague
![Page 2: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/2.jpg)
Overview
• Introduction• Service planning and current status
– Capacities– Networking– Personnel
• Monitoring– HW and SW– Middleware– Service
• Remarks
![Page 3: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/3.jpg)
Introduction
• Czech Republic’s LHC activities– ATLAS, target 3% of authors -> activities– ALICE, target 1 %– TOTEM, much smaller experiments, relative target higher.– (non LHC – HERA/H1, TEVATRON/D0, AUGER)
• Institutions (mention just big groups)– Academy of Sciences of the Czech Republic
• Institute of Physics• Nuclear Physics Institute
– Charles University in Prague• Faculty of Mathematics and Physics
– Czech Technical University in Prague• Faculty of Nuclear Sciences and Physical Engineering
• HEP manpower (2005)– 145 people
• 59 physicists• 22 engineers• 21 technicians• 43 undergraduate students a PHD students
![Page 4: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/4.jpg)
Service planning
2007 2008 2009 2010
ATLAS + ALICE
CPU (MSI2000) 0.4 1.9 2.9 4.7
Disk (TB) 202 904 1 420 2 354
MSS (TB) 120 562 1 013 1 652
• Table based on LCG MoU for ATLAS and Alice and our anticipated share• Project proposals to various grant systems in the Czech Republic• Prepare bigger project proposal for CZ GRID together with CESNET
– For the LHC needs– In 2010 add 3x more capacity for Czech non-HEP scientists, financed fro state resources and structural
funds of EU• All proposals include new personnel (up to 10 new persons)
• Today, regular financing, sufficient for D0– today 250 cores, 150 kSI2k, 40 TB disk space, no tapes
![Page 5: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/5.jpg)
Networking
• Local connection of institutes in Prague– Optical 1 Gbps
E2E lines
• WAN– Opticla E2E lines
to Fermilab, Taipei new FZK (from 1 Sept 06)
– Connection Prague – Amsterodam now through GN2
– Planning further lines to other T1s Sima @ CEF Networks workshop Prauge, May 30th, 2006
![Page 6: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/6.jpg)
Personnel
• Now 4 persons to run T2– Jiri Kosina – middleware (leaving, looking for replacement),
Storage (FTS), monitoring– Tomas Kouba – middleware, monitoring– Jan Svec – basic HW, OS, storage, networking, monitoring– Lukas Fiala - Basic HW, networking, web services
– Jiri Chudoba – liason to ATLAS and ALICE, running the jobs and reporting errors, service monitoring
• Further information is based on their experience
![Page 7: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/7.jpg)
Monitoring
• HW and basic SW– installation and test of new hardware
normally choose proven HWHW - installation by delivery firminstall operating system and solve problems with delivery firmsinstall middlewaretest it for some time outside the production service
– Nagiosworking nodes access via pingdisks – how the partitions are fullload averageif pbs_mom process is runningnumber of running processesif ssh demon is runninghow full is the swap….
• Limits for warning and error• Distribution of mails or SMS to admins – fixing problems remotely• Regular check of nagios web page for red dots
– Regular automatic (cron) checks and restarts for some daemons
![Page 8: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/8.jpg)
Monitoring
• PBS – job count (via RRD and mrtg)– Local tools for monitoring of number of jobs/machine/per chosen period
• Apel – not much useful, might be setup for more useful info
• Gridice• ATLAS
– Checks and statistics from ATLAS database• ALICE - Mona Lisa - very useful• Monitor pool accounts and actual user certificates
• Networking– Network traffic to FZK, SARA, CERN in certain ip range– With the help of ipaccounting (utility ipac-ng)
http://golias100.farm.particle.cz/ipac/
• SFT – site functional tests – very useful
![Page 9: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/9.jpg)
![Page 10: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/10.jpg)
outgoing to fzk1 Max: 37M Average: 6M Total: 129G
outgoing to internetMax: 61M Average: 8M Total: 164G
![Page 11: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/11.jpg)
Updates and patches
• YAIM + automated updates on all farm nodes using simple BEX script toolkit (takes care of upgrading the node which was switched off at the deployment/upgrade phase ... keeps all nodes in sync automatically)ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz, info in README file
![Page 12: 5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697c00a1a28abf838cc7e5c/html5/thumbnails/12.jpg)
Service monitoring
• Using higher described checks and their combinations• Rely on centrally/by experiments supported useful monitors
• We would appreciate to receive early warning if jobs on some site/working_nodes start quickly fail after submission
• Service requirements for T2s in “extended”working hours– No special plan today– Try to provide architecture that responsible people can even travel and
do as much as possible remotely (e.g. network console access)– Future computing capacities will probably require new arrangements