CMS UK Computing - Indico...QMUL - 14,934 N/A 2 TB RHUL - Opportunistic - 0 Oxford - Opportunistic -...

Katy Ellis

28th August 2019

GriddPP43

CMS usage of UK resources

Introduction

• Katy Ellis

• New(ish) CMS / RAL Tier 1 Liaison

• Started in September 2018

• New to CMS (PhD on ATLAS, 2012)

• Included in my role – operations, improving job efficiency/failure rate, “projects”• DOMA TPC, Rucio integration for CMS with CTA, XRootD investigations.

Contents

• CMS experiment upgrades in LS2

• Review of CMS UK computing resources

• News from Tier 2s

• Additional CMS resources

• Progress with Rucio for CMS

LHC schedule

CMS current status

• Preparing for Run 3• Hardware upgrades

• Computing upgrades

• Preparing for the longer term future• Long Shutdown 3 activities already

being planned in detail.

• HL-LHC civil-engineering work is ongoing since June 2018 - five new buildings on the surface, as well as modifications to the underground cavern and galleries.

Long shutdown 2 CMS upgrades

• Installation of new beampipe

• Replacement pixel detector (innermost layer)

• Upgraded power system for the magnet

• Installation of new multi-GEM chambers for increased coverage of muon detection

CMS UK Computing ResourcesSite CPU (HS06) CPU (HS06) Disk Storage Disk Storage

Pledge Provision Pledge Provision

RAL Tier 1 52,000 61,408 5.44 PB* 5.44 PB*

Brunel London 1.49 PB

Imperial College London 44,198 4.50 PB

RALPP (Tier2) South 24,122 3.73 PB

Bristol 727 TB

QMUL - 14,934 N/A 2 TB

RHUL - Opportunistic - 0

Oxford - Opportunistic - 41 TB

Glasgow - Opportunistic - 174

DODAS N/A Opportunistic - N/A

CMS @ home N/A Volunteered - N/A

* Includes 200 TB tape buffer

Source: REBUS

Total T2 CPU pledge 50 kHS06London CPU 32,641London Disk 2.925 PBSouth CPU 17,326South Disk 975 TB

+ 17.6 PB Tape pledge at T1

CPU in last 6 months – Total completed jobs

Running cores

CMS Pledges

From EGI Accounting

52kHS0648kHS06

T1

T2s

Opp

Total T2 pledge = 50kHS06

CMS Pledges

From EGI Accounting

52kHS0648kHS06

The Katy Effect?

Imperial College news• IC have moved their data centre from Kensington to Slough!

• Coordination by Simon Fayer

• ~ 1 week

• Data centre is now run remotely, but tended by local system administrators - no noticeable difference

RALPP news

• In the last month, RALPP has joined the LHCONE network• LHC Open Network Environment.

• Improve data access by flattening the T1/2/3 hierarchy so that any site may connect with any other.

• One issue connecting with FNAL FTS, but quickly resolved.

• This is a precursor to RAL T1 joining LHCONE – other T1s already connected.

Incorporating additional sites

• Analysis jobs submitted to RAL T1/T2, IC or Brunel will also match Glasgow, Oxford, QMUL and RHUL.

• UK sites are now a mesh structure• ”CMS Tier 3” sites read/write data

from/to RALPP

• CMS are keen to extend this to other non-CMS sites.

DODAS – “Dynamic On Demand Analysis Service”

• Start a personal Tier 2 (3?) on the cloud.

• Running in ~20 minutes on OpenStack, Azure, AWS, EGI-clouds.

• Developed in Italy, adapted and tested by Riccardo di Maria at IC.• Looking for a new person.• Tested on a temporary cloud of 800

nodes at IC.

• “Useful if you have a deadline and a credit card”.

CMS@home *

• Volunteers running CMS jobs on their personal computers

• They can view monit plots with CERN or affiliate credentials, or Facebook/Google/etc.

• Almost entirely single-core MC Production jobs

• Planning to make submission more automated.

• Trying to move to SLC7.

* Talk to Ivan Reid if you want to know more

Move to Rucio

• Rucio will replace PhEDEx from Run 3• File transfer service• File catalogue• Highly scalable• Heterogeneous storage systems worldwide• Run centrally

• Used successfully by ATLAS for several years

• Now open to the wider community

• CMS activities include: integration with Production and User Analysis job submission, setup of databases for e.g. User accounts, data synchronization performance testing, setup of monitoring, etc.

Rucio for CMS – current status

• Rucio components are automated in Kubernetes.• Monitoring will be added in Kibana.

• ‘Million file test’ progressing well – being repeated.• Monitoring in early stages…

• Some level of synching on many sites - NanoAOD. • Subscriptions.

Rucio for CMS – Tape

• Able to transfer into all Tier 1 tape systems• RAL was most difficult – different Rucio config

• Have now fixed RAL config, and able to use the tape as source

• Cannot yet use other T1 tapes as source• Hoping the fix for RAL will point towards a solution

• Able to transfer into and out of CERN Tape Archive (CTA)• Must be done via EOS

• Started more substantial tests, ~10 TB

• Waiting for monitoring from CTA (2.155 TB were written in 3 hours)

• Working towards a ~200 TB transfer test

Summary

• CMS is preparing for Run 3 and beyond.

• UK sites are meeting their pledge, and often exceeding it.

• CMS Tier 2s are working well.

• Thanks to other sites for offering spare capacity.

• Tier 1 continues to make improvements.

• Rucio integration and testing for CMS is in full swing.

Backup

VO shares for UK sites

Increase in data rate

Long shutdown activities

• Detector upgrades

• Production

• Tape and disk cleaning

• CERN Tape Archive

• Rucio

CMS detector upgrades for Run 3 Further upgrades for HL-LHC in Runs 4 and 5 already in detailed planning stage via TDRs.

Rucio and CERN Tape Archive (CTA)

• CTA: CERN Tape Archive, which replaces CASTOR at CERN this summer• Meta-data migration only

• Change to the high-level structure? Possible issue with Rucio

• RAL will also be changing tape system in the medium-term• Tender is out

• Useful to gain expertise integrating Rucio with tape systems

• Pre-production service on CTA, with a test Rucio Storage Element (RSE)

Details on CMS T2s (pledge)• Imperial (2200TB)

• Moving the data centre to Slough in June

• 2 * 100 Gb/s network (one is fallback)

• RAL_PP (1100TB, 1600TB imminently) • Connecting to LHCONE soon

• Brunel (500TB)• CMS using close to 100% of storage due to a bug

• Some issues after upgrading to DOME version of DPM. Better testing before deployment would improve this situation.

• 40Gb/s coming in May

‘Old’ CERN CASTOR tape setup

‘New’ CERN CTA tape setup

Why is it so complicated?

CMS UK Computing - Indico...QMUL - 14,934 N/A 2 TB RHUL - Opportunistic - 0 Oxford - Opportunistic -...

Documents

Transcript of CMS UK Computing - Indico...QMUL - 14,934 N/A 2 TB RHUL - Opportunistic - 0 Oxford - Opportunistic -...