Issues from GDB
description
Transcript of Issues from GDB
LCG
Issues from GDB
John Gordon, STFCWLCG MB meeting September 28th 2010
LCG
Topics at September GDB
• OPN Monitoring• APEL• CERNVMFS• Experiments’ Operational Issues (Quarterly)• Others
LCG
J. Shade/GDB LHCOPN Update 3
• Missing a central view of LHCOPN• HADES data exists (at DFN?)• Prototype dashboard
• Site status is up when OWD between +/-15% from baseline and packet loss less than 0.1% per five minutes
• Site status is down when packet loss = 100% per five minutes
• Site status is degraded when measurement values are between a) and b).
Monitoring
08-SEP-2010
LCG
J. Shade/GDB LHCOPN Update 4
Prototype Dashboard
08-SEP-2010
LCG
J. Shade/GDB LHCOPN Update 5
Prototype Dashboard
08-SEP-2010
LCG
J. Shade/GDB LHCOPN Update 6
• DANTE baulked at the idea of developing their prototype further and supporting it
• SARA and CERN have picked up the gauntlet.
• An historical view was requested and is foreseen.
• Questions raised about problem solving procedures.
Monitoring
08-SEP-2010
LCG
APEL
• Update on latest status.• Version using ActiveMQ message passing has been in
production since June– New node type glite-apel replaces glite-MON.– Performant and reliable
• Sites encouraged to migrate• Anticipate switching off central R-GMA registry at end of
2010. • Requested WLCG input for EGI/EMI development plans
7
LCG
CERNVMFS for Software Servers
• The stress on shared software servers has been an issue for experiment and site operations over the summer
• PIC and RAL have tested CERNVMFS as a mechanism for distributing experiment software from CERN to worker nodes.
• CERNVMFS was developed in OpenLab and has been used to build virtual machine images on demand with experiment software
• It uses squid caches to bring software to a site on demand and also caches on WN relieving pressure on site servers.
• Removes the need to run jobs to install software at site. Only caches versions used at that site. Removes duplicate files between and within releases.
• Initial feedback encouraging. Tests will be scaled up to full site in cooperation with experiments. ATLAS for now but other interested.
8
LCG
Experiment Operations Feedback
• Alice were happy • ATLAS raised the issue of disk server reliability. What
they measured were the # incidents where a server was out >24 hours. This is a combination of hardware/software reliability and promptness of the site in restoring the service. Scope for standardising responses across Tier1s.– Concerns about ASGC performance
• CMS interested in CernVMFS work for their Tier3s.– Discussion around information publishing (related to L Field
proposal on WLCG Information Officer)
9
LCG
Experiment Operations Feedback
• LHCb have problems with differing configurations at sites. They believe they can adapt their use if they only have enough information. One suggestion would be a Site Card (cf the VO Card) which specified enough information about the site to enable LHCb to automate optimisation of their use. Discussion in the meeting doubted whether this could be automated and suggested one to one discussion with the site as a better route.
10
LCG
gLite 3.1 Support• Further work on retiring some glite 3.1 services. • Glite developers have proposed the end of life of some
services. WLCG asked for comment. – https://twiki.cern.ch/twiki/bin/view/EGEE/LCGprioritiesgLite
• EGI Operations will plan with NGIs and their sites taking WLCG views on board.
• Potential gap in EMI support filled. Specific sites have agreed to continue middleware support of batch systems required by WLCG. This covers support of CE Information Providers, blahd, and APEL parser.
11
LCG
Misc.
• Gstat – – announced new wlcg gstat to be checked by sites.– Gave Ian’s timeline
• glexec.– New Condor release over summer should address concerns of
ATLAS. ATLAS and CMS asked to runs tests again with latest Condor.
12
LCG
October GDB
• Feedback from the DAaMonstrators– What can they show now?– What will they deliver for the end of the year?– Review by panel early in new year.
• Security Incident response• glite 3.1 retiral• Installed capacity• glexec testing
13