Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA...

31
Computing readiness for Belle II Benedikt Hegner, Paul Laycock for the BNL team

Transcript of Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA...

Page 1: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Computing readiness for Belle II

Benedikt Hegner, Paul Laycockfor the BNL team

Page 2: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Overview• BNL is US Tier-1 computing facility for Belle II

• Raw data centre: 100% 2018-2020, 30% thereafter

• Overview of CPU, storage and network provisioned

• BNL responsibilities in software & computing

• Conditions database service

• Distributed data management

• Plans for the future

!2

Page 3: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

BNL Tier-1 resources• CPU (USA pledge is 41.53 kHS06 in 2019)

• 27 nodes,192GB of RAM and 16 TB (4x4TB) local disks • 1944 total job slots ~ 28kHS06 • Adding 21 additional nodes = 1512 additional job slots

• total of 49 nodes = 3528 job slots ~ 49 kHS06 by April 2019

• Disk Storage (USA pledge is 1.68PB in 2019) • 1.5PB of total storage • Adding 2PB of additional storage

• 1PB old storage will be retired • 2.5PB of total storage by January 2019

• Tape Storage (USA pledge is 5.14PB in 2019) • 1.2PB of total storage, adding 1.8PB soon • Can borrow and share resources for tape within the RACF, allows for optimising

purchases

• Network • Two external 100Gbps links, adding a third 100Gbps link in 2019

!3

Page 4: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

BNL Belle II computing & software responsibilities

!4

Conditions database (CDB)

Distributed data management(DDM)

Page 5: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

BNL Belle II computing & software people

!5

Conditions database (CDB)

Distributed data management(DDM)

Benedikt Hegner (100%)

Ruslan Mashinistov (50%)

Paul Laycock (100%)

Sergey Padolski (50%)

Maxim Potekhin

Carlos Gamboa (45%)

Hironori Ito (45%)

John De Stefano (10%) [Alex Undrus]

Core team(% FY2019)

SupportProject

Ruslan Mashinistov (50%)

[Mikhail Borodin]

[No longer on Belle II]

Core team completed with arrival of Hegner and Laycock in September 2018

Page 6: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

PNLL to BNL service migration - CDB

• CDB transition completed as planned, thanks to thorough planning and excellent communication between PNNL and BNL colleagues - thanks to all involved

!6

Page 7: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

CDB service deployment

• With this scalable deployment model, all tests so far show the system can comfortably handle expected Belle II workloads - thanks to all involved also in testing

!7

• Transition well advertised and coordinated - handled by an updated basf2 release

• handful of users needed to be pointed at the announcements

• General code documentation provided

• No documentation of PNNL kubernetes infrastructure

• New deployment model developed at BNL (C. Gamboa), shown right

Belle2db Metadata Service

Page 8: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

CDB status and plan• Several code updates and upgrades were needed, as could be expected

• squid version 3.5 now used (previous version no longer supported)

• Continually improving monitoring infrastructure to minimise effort required to address operations issues

• Swagger user interface uses secure http by default and is password protected

• retired redundant web UI

• b2conditionsdb client tools can authenticate to b2s: reader, writer and coordinator roles supported

• Strong authentication and authorisation being discussed with collaboration

• B. Hegner now co-coordinating database group, will help ensure we understand requirements

• Workflows and use cases rather poorly defined up until now, acknowledged by database group

• Investigating use cases for a global tag browser (browse GT structure, check IOV integrity, etc.)

• Investigated using a simple cache with DESY

• Consulting with DB group and DESY on cost of full service replication

!8

Page 9: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

PNLL to BNL Service migration - DDM

• Distributed data management is a key component of BelleDIRAC

• Designed by PNNL according to the DIRAC architecture, heavily used by the Fabrication system to transfer production outputs

• Key use case: gather the outputs belonging to one data block on to a preferred Tier-1 site for merging

!9

Page 10: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM status summary• Good collaboration between PNNL and distributed computing team led to well

understood design

• Commissioning work was needed before putting some PNNL code into production

• Basic components (replication, deletion) functional, room for improvement in implementation to meet phase 3 requirements

• In close collaboration with distributed computing team, basic functionality improved by S. Padolski• e.g. making SE-wise parallelisation of algorithms

• Long list of outstanding development requirements for a full-fledged state-of-the-art DDM for phase 3

• Evaluate community standard, Rucio

!10

Page 11: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM status

!11

Task Priority Complexity Status Comment

Functional tests High Easy Complete

Deletion subsystem High Medium Complete

Replica Policy High High Incomplete Ruslan working

Monitoring Medium Medium Complete Basic functionality only

SE plugin support High Medium Complete

SE-parallel agents High Medium Complete Became high priority in 2018

RMS queue management High Very high Complete Became high priority

in 2018

High priority items from the January 2018 review:https://kds.kek.jp/indico/event/26522/session/5/contribution/148

Page 12: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM development plan

!12

Task Complexity Status

Data integrity High NewAutomatic deletion of tmp

disk Medium New

Dark file detection High NewDataset containers High New

Data lifetime Very high New

Distributed computing development plan in confluence:https://confluence.desy.de/display/BI/DDM+Development+Plans

• Functionality required by a full-fledged state-of-the-art DDM

• Nothing currently implemented• Evaluate community standard, Rucio

Page 13: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM requirements

• Assume at peak we saturate 4 point to point 100Gbps connections

• 50GB/s, assuming 2GB/file means 25Hz copy, implies 25Hz delete, and one log file for each file = 100Hz

!13

• Global network upgrade allowing 100Gbps connections worldwide

• Exploited by Belle II computing model, automatically distribute data to sites

• Conservatively estimate that in longer term, the rate could spike up to ~100Hz (see below)

• Computing model needs an appropriate DDM that can deal with this - current DDM operates at ~1Hz

Page 14: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Rucio evaluation• Rucio has the advanced functionality and scalability

that Belle II DDM needs

• https://rucio.cern.ch/

• Missing key features of current DDM include

• Authentication, accounts, quotas

• Data discovery, monitoring, analytics

• Data lifetime

• crucial for managing data is being able to delete it, rucio also tracks data popularity, identify write-once-read-never data

• Data integrity, dark data, dataset containers

• Stable rucio operation for ATLAS beyond the scale needed for Belle II

!14

Page 15: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Rucio support• Developed by ATLAS

adopted by CMS and DUNE

• and a growing community

• Rucio is open source:

• https://github.com/rucio/rucio

• Weekly dev meetings to set priorities

• Use slack for fast feedback with the core development team, with the pool of expertise growing all the time

• Currently very dependent on core CERN team, working to ensure their availability will not be an issue in the future (participate in the second Rucio community workshop)

• Experience of other experiments moving to rucio was persuasive

!15

Page 16: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM service• Decided to evaluate Rucio while sustaining service for the current DDM (CDDM)

• Productive visits of rucio project lead (M. Barisits) and Belle II distributed computing lead (Ueda-san) and fabrication system lead (Miyake-san) in November

• No blockers identified, development plan agreed

• Rucio-based DDM hides everything behind the API, no effect on clients, transparent migration possible, easy to switch

• Teething problems encountered that have slowed down development

• First with client installation, then with client-server incompatibility

• Will continue to operate CDDM, investigating putting replication-policy subsystem into production (ongoing)

• CDDM requires a lot of manual work by DC lead

• Continue to work closely with distributed computing team to establish the path of least pain for all concerned

• RAW data replication use case prioritised, rucio-based solution could potentially be brought online earlier

• Fabrication system use case tightly coupled with current system

!16

Page 17: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Original plan

October November December January February March

Feature Freeze

Development Freeze

Test Rucio in production (limited functionality)

Rucio used in production with core functionality

Contingency Ready for Phase 3

• Requires de-scoping development of CDDM in favour of stability for Phase 3

Improve CDDM

Commission Rucio

!17

Page 18: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

DDM plan

October November December January February March

Commission Replication Policy subsystem

Commission Replication Policy subsystem

Development Freeze

Rucio development environment and setup

Test Rucio in production (limited functionality)

Rucio used in production with core functionality

Ready for Phase 3

•Contingency gone, very tight to commission for March, assumes evaluation of rucio is successful

•Assure continued operation of CDDM for production

Commission Rucio

!18

Improve CDDM

Page 19: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Review Recommendations - Management and Budget1. Software professionals need to be identified and brought on board as soon as possible for CDB

and DDMS support and development.

Done, see personnel table earlier.

2. Further possibilities to fit within the EOP envelope, beyond the single proposal to drop a crucial software effort, should be examined. Development of a resource-constrained plan with impacts will allow future flexibility when needed.

Done.

3. The draft MOU for US contributions to computing with Belle II/KEK should be available by March 31, 2018.

Done.

4. Establish monthly status updates for US Belle II Computer Operations, preferably in conjunction with the already-existing Detector Ops monthly status update.

Done.

!19

Page 20: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Review Recommendations - Risk5. The risk register should be updated regularly to retire or modify risks that were uniquely associated with PNNL.

Done, updated quarterly.

6. The granularity of the risk register should be tied more firmly to a WBS-like list of tasks, e.g. “networking” risks are more factorable than what is presented in the current risk register.

Done, see backup for latest version.

7. There needs to further delineation between risks and opportunities.

In progress.

8. There needs to be a firmer basis of estimate, or further details as to how probabilities and impacts were quantified.

Done.

9. The current risk registry includes both detector and computer operations, with a change control mechanism to draw on management reserves appropriate for PNNL. This mechanism may be less suitable for BNL. If so, an appropriate change control mechanisms for computing risks should be implemented.

Done, it is appropriate.

!20

Page 21: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Review Recommendations - Computing Model10. Continue development of plans for universities to participate in analyses of the

data challenges leading to colliding beams data. US Belle II, in conjunction with international Belle II, should incorporate detailed design, prototyping and scaling tests of a distributed analysis model to ensure that a sufficiently scalable solution is available prior to data taking.

Data challenge effort led by J. Bennett, scaling and prototyping tests of distributed analysis have begun.

11. The computing team should continue to consider mirroring the key services to minimize potential impact of networks and/or external site instability. The mirroring plans currently concentrate on failover while neglecting disaster recovery.

Network and site stability is very well managed at BNL (three 100Gbps connections in 2019 and an onsite ESnet engineer). We are consulting relevant groups, experts and stake-holders in Belle II as well as other labs (DESY) regarding disaster recovery (requires full service replication).

!21

Page 22: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Partners

!22

• Increasing collaboration and coordination with Mississippi, Jake Bennett and Michel Hernandez

• Raw data registration/replication tools (integration with DDM)

• Monitoring tools for distributed computing and GRID computing display for outreach

• Contributions to grid-based analysis tools development and support

• Working with Belle II distributed computing team members, particularly KEK (I. Ueda and H. Miyake) on distributed data management

• Working with CERN Rucio core team, particularly M. Barisits

• Working with Belle II distributed computing team and UBC (Racha Cheaib) to improve skim production and distributed data analysis

• Working with LMU (Martin Ritter) and Ljubjlana (Marko Bracko) to improve understanding of calibration workflows and client tools for CDB

Page 23: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Summary current plans• Core team of computing and software experts now complete and fulfilling BNL responsibilities

• Tier-1 and raw data centre, as well as CDB and DDM infrastructure, provided

• CDB development well advanced

• Basic authentication/authorization in place, working on enforcing strong authentication

• Work on understanding workflows and use cases started

• DDM development provides some cause for concern

• Support for current DDM with basic functionality guaranteed for phase 3, adding replication by policy if possible, minimising the need for manual intervention

• Rucio DDM deployment may be later than start of phase 3, depending on development requests for CDDM

• DDM migration designed to allow transparent switch to rucio, working closely with distributed computing team and fabrication system experts

!23

Page 24: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Future plans• Beyond delivering baseline commitments, wish to strengthen quality of service and

help facilitate analysis

• B. Hegner in role as database group co-coordinator, ensure calibration workflows are fully supported and CDB service is a silent component of distributed computing and analysis

• P. Laycock as HSF Data Analysis working group co-convenor will ensure Belle II is well represented

• Work together with colleagues in Data Production group, particularly Mississippi (J. Bennett and M. Hernandez) and UBC (R. Cheaib) to strengthen links with distributed computing to improve distributed analysis

• Work with Belle II experts and other international labs to ensure optimal quality of service for reasonable cost

• Thank you for your attention!

!24

Page 25: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

Backup

!25

Page 26: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

USA pledges 2019 and beyond

!26

2019 2020 2021 2020

Tape (PB) 5.14 12.10 15.25 19.15

Disk (PB) 1.68 3.00 3.47 5.87

CPU (kHS06) 41.53 60.09 70.69 92.31

Page 27: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock !27

Risk register matrix

Page 28: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock !28

Computing risk register (1)

Page 29: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock !29

Computing risk register (2)

Page 30: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock !30

Computing risk register (3)

Page 31: Computing readiness for Belle II · Benedikt Hegner, Paul Laycock BNL Tier-1 resources • CPU (USA pledge is 41.53 kHS06 in 2019) • 27 nodes,192GB of RAM and 16 TB (4x4TB) local

Benedikt Hegner, Paul Laycock

CDB testing

• Gatling configured to make 400 requests / second for 900 seconds, looping through a list of 21.2k exp#/run# records

• Service dealt with this workload and several other tests

!31