Issues from GDB

13
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010

description

Issues from GDB. John Gordon, STFC WLCG MB meeting September 28 th 2010. Topics at September GDB. OPN Monitoring APEL CERNVMFS Experiments ’ Operational Issues ( Quarterly ) Others. Monitoring. Missing a central view of LHCOPN HADES data exists (at DFN?) Prototype dashboard - PowerPoint PPT Presentation

Transcript of Issues from GDB

Page 1: Issues from GDB

LCG

Issues from GDB

John Gordon, STFCWLCG MB meeting September 28th 2010

Page 2: Issues from GDB

LCG

Topics at September GDB

• OPN Monitoring• APEL• CERNVMFS• Experiments’ Operational Issues (Quarterly)• Others

Page 3: Issues from GDB

LCG

J. Shade/GDB LHCOPN Update 3

• Missing a central view of LHCOPN• HADES data exists (at DFN?)• Prototype dashboard

• Site status is up when OWD between +/-15% from baseline and packet loss less than 0.1% per five minutes

• Site status is down when packet loss = 100% per five minutes

• Site status is degraded when measurement values are between a) and b).

Monitoring

08-SEP-2010

Page 4: Issues from GDB

LCG

J. Shade/GDB LHCOPN Update 4

Prototype Dashboard

08-SEP-2010

Page 5: Issues from GDB

LCG

J. Shade/GDB LHCOPN Update 5

Prototype Dashboard

08-SEP-2010

Page 6: Issues from GDB

LCG

J. Shade/GDB LHCOPN Update 6

• DANTE baulked at the idea of developing their prototype further and supporting it

• SARA and CERN have picked up the gauntlet.

• An historical view was requested and is foreseen.

• Questions raised about problem solving procedures.

Monitoring

08-SEP-2010

Page 7: Issues from GDB

LCG

APEL

• Update on latest status.• Version using ActiveMQ message passing has been in

production since June– New node type glite-apel replaces glite-MON.– Performant and reliable

• Sites encouraged to migrate• Anticipate switching off central R-GMA registry at end of

2010. • Requested WLCG input for EGI/EMI development plans

7

Page 8: Issues from GDB

LCG

CERNVMFS for Software Servers

• The stress on shared software servers has been an issue for experiment and site operations over the summer

• PIC and RAL have tested CERNVMFS as a mechanism for distributing experiment software from CERN to worker nodes.

• CERNVMFS was developed in OpenLab and has been used to build virtual machine images on demand with experiment software

• It uses squid caches to bring software to a site on demand and also caches on WN relieving pressure on site servers.

• Removes the need to run jobs to install software at site. Only caches versions used at that site. Removes duplicate files between and within releases.

• Initial feedback encouraging. Tests will be scaled up to full site in cooperation with experiments. ATLAS for now but other interested.

8

Page 9: Issues from GDB

LCG

Experiment Operations Feedback

• Alice were happy • ATLAS raised the issue of disk server reliability. What

they measured were the # incidents where a server was out >24 hours. This is a combination of hardware/software reliability and promptness of the site in restoring the service. Scope for standardising responses across Tier1s.– Concerns about ASGC performance

• CMS interested in CernVMFS work for their Tier3s.– Discussion around information publishing (related to L Field

proposal on WLCG Information Officer)

9

Page 10: Issues from GDB

LCG

Experiment Operations Feedback

• LHCb have problems with differing configurations at sites. They believe they can adapt their use if they only have enough information. One suggestion would be a Site Card (cf the VO Card) which specified enough information about the site to enable LHCb to automate optimisation of their use. Discussion in the meeting doubted whether this could be automated and suggested one to one discussion with the site as a better route.

10

Page 11: Issues from GDB

LCG

gLite 3.1 Support• Further work on retiring some glite 3.1 services. • Glite developers have proposed the end of life of some

services. WLCG asked for comment. – https://twiki.cern.ch/twiki/bin/view/EGEE/LCGprioritiesgLite

• EGI Operations will plan with NGIs and their sites taking WLCG views on board.

• Potential gap in EMI support filled. Specific sites have agreed to continue middleware support of batch systems required by WLCG. This covers support of CE Information Providers, blahd, and APEL parser.

11

Page 12: Issues from GDB

LCG

Misc.

• Gstat – – announced new wlcg gstat to be checked by sites.– Gave Ian’s timeline

• glexec.– New Condor release over summer should address concerns of

ATLAS. ATLAS and CMS asked to runs tests again with latest Condor.

12

Page 13: Issues from GDB

LCG

October GDB

• Feedback from the DAaMonstrators– What can they show now?– What will they deliver for the end of the year?– Review by panel early in new year.

• Security Incident response• glite 3.1 retiral• Installed capacity• glexec testing

13