WLCG Service Report [email protected] ~~~ WLCG Management Board, 9 th August 2011 1.

WLCG Service Report

[email protected]@cern.ch~~~

WLCG Management Board, 9th August 2011

1

Introduction• 3 busy weeks since the last MB report on July 19th

• Good data taking with LHC record fills (passed the 2 fb-1 mark on August 5!)

• Three Service Incident Reports received:• IN2P3 outage of 13 DBs due to disk failures on July 19th–21st (SIR)

• Affected Atlas (COOL, LFC, AMI), CMS (FTS), LHCb (COOL, LFC) for >1 week• GGUS ALARM submission affected by KIT mail interface, July 22th-26th (SIR)• Loss of 11k ATLAS files at KIT due to dirty GPFS, July 12th-26th (SIR)

• One more Service Incident Report is expected:• CERN KDC flood from ATLAS users in May-June (reported at last MB)

• 4 real GGUS ALARMS (3 for ATLAS and 1 for CMS)• All about storage – at CERN (Castor) and CNAF (Storm)

• Other notable issues reported at the daily meetings• Major power outage at FNAL due to thunderstorm on July 29• Storm issues at many ATLAS sites after 1.7.0 upgrade, applied workarounds • Low CPU efficiency of ALICE jobs finally solved (new hw, xrootd, svc config)• ADCR DB performance slow (after move to standby hw, but not correlated?)

2

https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_19july2011.pdf

https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110727GGUS_Service_Incident_Report.pdf

https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-kit-atlas-dcache-20110728.pdf

GGUS summary (3 weeks)

VO User Team Alarm Total

ALICE 4 0 1 5

ATLAS 15 105 7 127

CMS 4 0 2 6

LHCb 8 22 1 31

Totals 31 127 11 169

3

WLCG MB Report WLCG Service Report

Support-related events since last MB

•There were 4 real ALARM tickets since the 2011/07/18 MB (3 weeks), 3 submitted by ATLAS, 1 by CMS, all ‘solved’ and ‘verified’; 2 of them for CERN CASTOR, 2 for CNAF Storm.

• Ongoing GGUS problems in ALARM submission and/or escalation:

• Problems between June 12-27 already reported at last MB, due to new KIT exim mailer and supposedly solved during week of June 27• For ATLAS ticket on July 24, GGUS did not allow ALARM submission and also failed to notify operators on TEAM-to-ALARM escalation. For CMS ALARM submitted on July 26, piquet was not called. These issues were solved last week at KIT (see SIR) and validated with test alarms.• This weekend again an ALARM submitted by ATLAS with INFN on August 6 did not reach the SMS system of the site. This had already been reported on July 17 (GGUS:72717). CNAF reported this morning that a fix has been applied and validated (tests have confirmed that ALARMS correctly trigger SMS messages).

4

https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110727GGUS_Service_Incident_Report.pdf

https://ggus.eu/ws/ticket_info.php?ticket=72717

ATLAS ALARM->CERN CASTORATLAS DOWN GGUS:72890


What time UTC What happened

2011/07/24 03:16SUNDAY

GGUS TEAM ticket (as GGUS did not allow direct ALARM submission!), automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2011/07/24 03:17

Submitter immediately escalates ticket to ALARM. Email notification recorded as ‘Sent to [email protected]’ (but no email received by operators & service mgrs?). Automatic SNOW ticket creation successful.

2011/07/24 06:34

Supporter records that data export from CERN is also affected

2011/07/24 06:43-07:57

Supporter calls 75011. Operator had received no alarm! Supporter emails [email protected] and later also [email protected] and [email protected].

2011/07/24 08:03

Castor developer confirms TEAM-to-ALARM did not work and observes that no problem can be seen at this time.

2011/07/24 08:20- 08:44

Supporter confirms problem was real. ATLAS data export still suffering due to backlog accumulated when CASTOR down.

2011/07/26 10:16

Castor mgr puts ticket on hold, discussion ongoing with ATLAS

2011/07/29 16:35-20:56

Castor expert sets ticket ‘solved’, applying workarounds and hotfixes. Submitter sets ticket ‘verified’.

5

https://gus.fzk.de/ws/ticket_info.php?ticket=72890

mailto:[email protected]



https://gus.fzk.de/ws/ticket_info.php?ticket=72944

CMS ALARM->CERN CASTOR XROOTD REDIRECTOR NOT WORKING GGUS:72944



2011/07/26 08:56

GGUS ALARM ticket, automatic notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/07/26 09:56

Castor admin restarts redirector and asks if all ok. “Redirector threads were busy with CASTOR (stuck in synchronous Puts), so new requests were stuck (and would get eventually run into Kerberos Clock skew detection). The number of threads can be increased, but this might point to some overload issue. We might also have hit some issue with locking on the Kerberos replay cache, a core dump was taken and is being looked at.”

2011/07/26 09:58

Castor admin adds “For the record, ALARM seems not to have reached CERN via the usual channels (i.e. no parallel routing to CERN operator or SMS alert list, hence no piquet call)”.

2011/07/26 10:15-21:12

Submitter replies and Castor admin sets ticket ‘solved’ and later ‘verified’.

6

ATLAS ALARM->INFN SRM DOWN GGUS:73054



2011/07/29 15:13

GGUS TEAM ticket, automatic email notification to [email protected] AND automatic assignment to NGI_IT.

2011/07/29 16:02

Transfers from T0 are also failing. Supporter escalates ticket to ALARM. Notification sent to address [email protected].

2011/07/29 16:02

Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.”

2011/07/29 16:45

Site admin restarts GPFS process in Storm BE, asks if ok now.

2011/07/29 17:21

Supporter confirms all is ok, ticket can be closed.

2011/07/31 03:23

Shifter reopens ticket because SRM is down again.

2011/07/31 05:04

Supporter sets ticket as closed and moves new SRM issue to new TEAM ticket GGUS:73068 (to be escalated if not solved promptly – but the issue is fixed at 05:53).

2011/07/31 17:22

Supporter sets ticket as ‘verified’.

7

ATLAS ALARM->INFN PUT GRIDFTP_COPY_WAIT: CONNECTION TIMED OUT GGUS:73236



2011/08/06 14:30 SATURDAY

GGUS TEAM ticket, automatic email notification to [email protected] AND automatic assignment to NGI_IT.

2011/08/06 17:43

SRM seems to be down. Supporter escalates ticket to ALARM. Notification sent to address [email protected].

2011/08/06 17:43

Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.”

2011/08/06 19:53

Site admin resets Storm BE via power cycle, asks if ok now. Problem with SMS will be investigated during the week.

2011/08/06 22:30

Supporter confirms all is ok, sets ticket as closed and verified.

8

2.1 2.1

4.1 4.1 4.1

Analysis of the availability plots: Week of 18/07/2011

• Atlas• 2.1 IN2P3-CC - UNSCHEDULED - problem with disk on

the oracle cluster - DB service was unstable

• LHCb• 4.1 LCG.IN2P3.fr - UNSCHEDULED - problem with disk

on the oracle cluster - DB service was unstable

2.1 2.1

3.1 3.1


• Atlas• 2.1 Taiwan-LCG2 - SCHEDULED - Network

Maintenance

• CMS• 3.1 T1_TW_ASGC - SCHEDULED - Network

Maintenance and Phedex agent upgrade


• All sites were operating above 50% threshold during the entire week. Nothing to report.

Conclusions

• Business as usual – successful record data taking• Serious issue with databases at IN2P3 affecting ATLAS,

CMS, LHCb

• Experienced many GGUS problems with ALARM submission and escalation (operators and piquet not always contacted)

15

WLCG Service Report [email protected] ~~~ WLCG Management Board, 9 th August 2011 1.

Documents

Transcript of WLCG Service Report [email protected] ~~~ WLCG Management Board, 9 th August 2011 1.