WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues,...

8
WLCG Networks: Update on Monitoring and Analytics Marian Babik 2, * Shawn McKee 1 , Pedro Andrade 2 , Brian Paul Bockelman 6 , Robert Gardner 4 , Edgar Mauricio Fajardo Hernandez 5 , Edoardo Martelli 2 , Ilija Vukotic 4 , Derek Weitzel 3 , and Marian Zvada 3 for the WLCG Network Throughput Working Group 1 Physics Department, University of Michigan, Ann Arbor, MI, USA 2 European Organisation for Nuclear Research (CERN), Geneva, Switzerland 3 University of Nebraska – Lincoln, Lincoln, NE, USA 4 Enrico Fermi Institute, University of Chicago, Chicago, IL, USA 5 University of California San Diego, La Jolla, CA, USA 6 Morgridge Institute of Research, Madison, WI, USA Abstract. WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee eective network usage and prompt detection and resolution of any network issues including connection failures, congestion and trac routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking information for its partners and constituents. It was established to ensure sites and experiments can better understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with higher level workload and data trans- fer services. This has been facilitated by the global network of the perfSONAR instances that have been commissioned and are operated in collaboration with WLCG Network Throughput Working Group. An additional important update is the inclusion of the newly funded NSF project SAND (Service Analytics and Network Diagnosis) which is focusing on network analytics. This paper describes the current state of the network measurement and analytics platform and summarises the activities taken by the working group and our collabora- tors. This includes the progress being made in providing higher level analytics, alerting and alarming from the rich set of network metrics we are gathering. 1 Introduction The Open Science Grid (OSG) and the Wordwide LHC Computing Grid (WLCG) have been supporting network monitoring activities since 2012, focusing on assisting their users and aliates on improving their overall network throughput by introducing active monitoring of their networks and providing the ability to test for and identify potential network performance bottlenecks [1, 2]. Two important areas of development that were undertaken were establish- ing and operating a global network of measurements agents and development and operations of a comprehensive networking monitoring platform, which collects and stores the measure- ments while making them available for further processing. This has been complemented by * We gratefully acknowledge the National Science Foundation which supported this work through NSF grants #1148698, $1836650 and #1827116. In addition, we acknowledge our collaborations with the WLCG and LH- CONE/LHCOPN communities who also participated in this eort. arXiv:2007.00598v1 [cs.NI] 1 Jul 2020

Transcript of WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues,...

Page 1: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

WLCG Networks: Update on Monitoring and Analytics

Marian Babik2,∗ Shawn McKee1, Pedro Andrade2, Brian Paul Bockelman6, Robert Gardner4,Edgar Mauricio Fajardo Hernandez5, Edoardo Martelli2, Ilija Vukotic4, Derek Weitzel3, andMarian Zvada3

for the WLCG Network Throughput Working Group1Physics Department, University of Michigan, Ann Arbor, MI, USA2European Organisation for Nuclear Research (CERN), Geneva, Switzerland3University of Nebraska – Lincoln, Lincoln, NE, USA4Enrico Fermi Institute, University of Chicago, Chicago, IL, USA5University of California San Diego, La Jolla, CA, USA6Morgridge Institute of Research, Madison, WI, USA

Abstract. WLCG relies on the network as a critical part of its infrastructureand therefore needs to guarantee effective network usage and prompt detectionand resolution of any network issues including connection failures, congestionand traffic routing. The OSG Networking Area, in partnership with WLCG, isfocused on being the primary source of networking information for its partnersand constituents. It was established to ensure sites and experiments can betterunderstand and fix networking issues, while providing an analytics platform thataggregates network monitoring data with higher level workload and data trans-fer services. This has been facilitated by the global network of the perfSONARinstances that have been commissioned and are operated in collaboration withWLCG Network Throughput Working Group. An additional important updateis the inclusion of the newly funded NSF project SAND (Service Analyticsand Network Diagnosis) which is focusing on network analytics. This paperdescribes the current state of the network measurement and analytics platformand summarises the activities taken by the working group and our collabora-tors. This includes the progress being made in providing higher level analytics,alerting and alarming from the rich set of network metrics we are gathering.

1 IntroductionThe Open Science Grid (OSG) and the Wordwide LHC Computing Grid (WLCG) have beensupporting network monitoring activities since 2012, focusing on assisting their users andaffiliates on improving their overall network throughput by introducing active monitoring oftheir networks and providing the ability to test for and identify potential network performancebottlenecks [1, 2]. Two important areas of development that were undertaken were establish-ing and operating a global network of measurements agents and development and operationsof a comprehensive networking monitoring platform, which collects and stores the measure-ments while making them available for further processing. This has been complemented by∗We gratefully acknowledge the National Science Foundation which supported this work through NSF grants

#1148698, $1836650 and #1827116. In addition, we acknowledge our collaborations with the WLCG and LH-CONE/LHCOPN communities who also participated in this effort.

arX

iv:2

007.

0059

8v1

[cs

.NI]

1 J

ul 2

020

Page 2: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

Figure 1: perfSONAR public network; there are currently around 2000 known deployed in-stances with likely an equal number of private deployments, WLCG network is one of thebiggest private deployments with over 250 instances connected to LHCOPN/LHCONE.

several activities that have improved our ability to manage and use both network topologyand network metrics to extract clearer understanding of our network problems, locations andbottlenecks via analytics[3].

WLCG Network Throughput Working Group was established in 2014 to help with someof the underlying tasks, such as overseeing the global network of measurement agents basedon perfSONAR[4], establishing baseline measurements and performing low-level debuggingactivities. This has lead to a dedicated network throughput support unit, which has provento successfully coordinate and resolve complex network performance incidents within LH-COPN and LHCONE[5].

2 Network PerformanceNetworks that connect sites and experiments need to handle ever increasing amounts of dataand convey it across multiple networks around the world. Due to the underlying complexity,end-to-end performance depends on a number of components and their operational statusanywhere within the network. When a network is under-performing or errors occur, it canbecome very difficult to identify and correct the source of the problem as local testing willoften not find the cause, as errors can occur anywhere along the path of data as it movesbetween multiple networks. While disconnect failures are relatively easy to detect and fix,soft failures where a network continues to function but has compromised performance can bevery hard to detect. Identification of such problems is best served by the active end-to-endmeasurements against a predefined target, which in the scope of WLCG and OSG means aglobal network of agents testing all possible network paths end to end.

3 OSG/WLCG Network Monitoring PlatformSuch global network of agents has been established in collaboration with WLCG and OSGsites based on perfSONAR, which is a network measurement toolkit designed to providefederated coverage of paths that helps to establish the end-to-end usage expectations (seeFig. 1). perfSONAR is open source software, developed by a consortium of ESnet, Internet2,Indiana University, University of Michigan and GEANT[4]. It provides a number of toolsthat can take various different network measurements covering different aspects of network

Page 3: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

functions, bundled in a comprehensive package including tools, scheduler, visualisation andcentralised management functions such as configuration and discovery.

Figure 2: OSG Network Monitoring Platform - distributed deployment that collects, stores,visualises and provides APIs for the measurements collected by the WLCG perfSONARinfrastructure

The toolkit supports a range of standard metrics that provide useful insights into the cur-rent state of the network. For latency and loss, apart from ping, it offers implementation of theone-way and two-way active measurement protocols (OWAMP/TWAMP)[6]. An importantmetric for end-to-end network performance is throughput, which can be measured by threedifferent tools: iperf3, iperf2 and nuttcp. The most common is iperf3, which can performmemory to memory tests over UDP or TCP and reports TCP retransmits and size of conges-tion window, which are both very useful in troubleshooting. The final part of the networkcharacteristics is the network path, which can be measured by traceroute or tracepath, thelatter being preferred due path MTU discovery as it can determine maximum transmissionunit (MTU) along the path and serves as an important indicator of MTU issues which havebecome quite common.

OSG has developed and deployed a comprehensive network monitoring platform[7] thatcollects, stores, visualises and further processes all the measurements taken by the perf-SONAR infrastructure, see Fig 2. At its core is a collector, which regularly connects tothe remote perfSONAR toolkits, downloads all recent measurements and publishes them tothe message bus based on RabbitMQ. This stream is then used to feed three different typesof stores, a short-term store located at University of Chicago, which stores data for the last 6months, a long-term store located at University of Nebraska, which stores the entire datasetand finally a tape system at FNAL, which is used as a persistent backup. The measurementsstream is also available to the experiments via ActiveMQ bus at CERN which is populatedby a dedicated bridge connected directly to RabbitMQ. The platform is also integrated withthe ATLAS Analytics and Machine Learning Platform[8] that makes it easy to combine andanalyze network measurements with metrics from various different sources (including Panda,FTS, Rucio, etc.).

The platform also contains a centralised configuration system[9] built upon PWA[10],which is used to configure the tests specifications (tools/measurements specs), meshes (col-lection of hosts participating in the tests) as well as test schedule for the entire infrastructure.There is also infrastructure monitoring[11, 12] that oversees the status of the platform andmeasurement infrastructure and a set of MaDDash dashboards that visualize the measure-ment results[13]. In addition, there are number of additional dashboards and visualisationsavailable that are discussed in Section 5.

Page 4: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

3.1 Job Network Measurements

In addition to the metrics collected by perfSONAR, the OSG also collects network metricsfrom submit hosts within the OSG. These submit hosts measure the network conditions be-tween the worker nodes and submit hosts during file transfers. File transfers generally onlyoccur when the job starts and when it completes. Therefore, the measurements do not capturethe status of the connection during job execution.

These job network measurements can capture aspects of the end-to-end path that mightbe untested by perfSONAR. For example, in the OSG, worker nodes can be behind a firewallor a NAT device and, in such cases, perfSONAR would often be connected at the networkedge and would not be measuring the same network path.

HTCondor is configured to output TCP statistics for data transfer connections betweenthe submit host and the worker node. The statistics include the number of loss packets, bytestransferred and TCP reordering events. These statistics are written to a log by HTCondorwhich is parsed and uploaded by Filebeats[14] in the same datastore we use for perfSONARmetrics. The data components are parsed and annotated, e.g., we augment transfer recordswith GeoIP information.

We are just beginning to collect and analyze the job network measurements. Figure 3shows the data transfer volume to job destinations within the U.S. for January 2020.

Figure 3: Map of destination by bytes transferred to jobs.

4 Platform Use

The platform and measurement infrastructure have been used in number of activities andcollaborations, improving our understanding of the networks and contributing to the technicalevolution and design.

Establishing end-site network throughput support has helped to resolve number of chal-lenging cases that would otherwise be very difficult to detect and isolate or would take consid-erable amount of time to resolution[15]. In addition, the unit has helped sites with their datacentre network design, consulting on the potential bottlenecks caused by the network equip-ment with insufficient buffers as well as helping to test and benchmark their performance. Thefeedback gathered from the support unit on the different cases has lead to a discussion and

Page 5: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

a concrete proposal for MTU recommendations for LHCOPN/LHCONE[16], which aims toimprove the overall throughput and standardise MTU deployment across R&Es and sites.

There were number of significant contributions to the development and design of networkperformance monitoring over the years, a notable example is the the current configurationsystem, which was initially developed as an internal OSG tool and was later adopted by theperfSONAR consortium. Another area of close collaboration was deployment and testing ofthe IPv6 readiness, which was lead by the HEPiX IPv6 working group[17]. This was a partic-ular example how the platform can be useful in the future to evaluate potential deployment ofthe new technologies (such as new TCP congestion control algorithms, software defined net-works, etc.). Another such example is a collaboration with HELIX NEBULA Science Cloudproject, which used the platform to assess network performance of the cloud providers. Fi-nally, close collaborations were established with other research domains and institutes thathave also shown interest in network performance and deployment of a similar platform as theone deployed for OSG/WLCG.

Figure 4: Visualisation of multiple network paths as measured by the traceroute tool betweenPurdue and Fermi National Accelerator Laboratory.

5 Network Analytics

Establishing the OSG Network Monitoring Platform and making the data available for ex-periments and network researchers has triggered great interest from different communitiesthat have started to look at the existing measurements and performed analysis with variousdifferent goals. At the same time, the platform has made it possible to diagnose and debugexisting network issues, identify the problematic links or equipment and help fix the under-lying problems. Among the several past and present projects, the following have deliverednotable results or identified important areas where further research is needed:

• Real-time detection of “obvious” issues and corresponding altering and notifications havebeen developed at University of Chicago and is currently being tested as part of the ATLASAnalytics and Machine Learning Platform[8].

• A study to derive how LHCOPN network paths perform from the existing OWAMP mea-surements has shown that OWAMP is sufficiently sensitive to pinpoint when network

Page 6: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

equipment gets stressed and could be used to easily detect peak periods. The main chal-lenge that still remains is how to extend the model to LHCONE, mainly due to the lack ofreliable network traffic data that could be used to train the neural network[18].• New visualisation platform for network paths was developed in collaboration with MEPhI1,

which allows to select and visualise existing paths between two endpoints, see Fig. 4• Network path analysis project is currently ongoing at University of Michigan and aims

to calculate simple statistics from the existing path measurements in order to auto-detectpotential routing problems and help with the visualisation of the measurements.• In collaboration with the SAND project[19, 20] and some of the other activities mentioned

in this section, we are developing a range of dashboards[21] using Kibana to provide dis-tinct insights into the perfSONAR metrics hosted in Elasticsearch.• Understanding the differences between network utilization as seen by R&E networks as

computed from the experiments data transfers is another area of interest. While there hasbeen significant effort contributed to understand network utilisation from the bulk datatransfers, there are still major gaps in getting reliable sources of information directly fromthe R&E networks.

Further analytical studies are planned to better understand our use of networks and how itcould be improved. The new versions of perfSONAR plan to integrate direct publishing of theresults and configurations needed to operate it globally that would help us make progress innumber of areas requiring access to real-time data as well as providing automated debuggingand optimisations.

6 Evolution and FutureIn summary, OSG in collaboration with WLCG have established a comprehensive networkmonitoring platform that has been used in a number of activities ranging from operations andsupport and technological deployments up to the research and developments for the networkanalytics. We have established and made progress in several areas of the network monitoringand plan to continue to evolve in the same areas also in the near term. There are numberof areas where significant R&D effort will be needed to progress on some of the previouslymentioned challenges, but there are also number of opportunities that could provide fundingand effort to continue the work. Two projects that will lead the operations and developmentin the HEP network monitoring are NSF funded IRIS-HEP and SAND. IRIS-HEP will fundthe LHC part of Open Science Grid, including the networking area and will create a newintegration path (the Scalable Systems Laboratory) to deliver its R&D activities into the dis-tributed and scientific production infrastructures. Service Analysis and Network Diagnosis(SAND) will be focusing on combining, visualising, and analyzing disparate network moni-toring and service logging data. It will extend and augment the OSG networking efforts witha primary goal of extracting useful insights and metrics from the wealth of network data beinggathered from perfSONAR, FTS, R&E network flows and related network information fromHTCondor and others.

7 AcknowledgementsWe gratefully acknowledge the National Science Foundation which supported this workthrough NSF grants OAC-1836650 and OAC-1827116. In addition, we acknowledge ourcollaborations with the CERN IT, WLCG and LHCONE/LHCOPN communities who alsoparticipated in this effort.

1https://eng.mephi.ru/

Page 7: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

References

[1] R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Blackburn,T. Wenaus, F. Würthwein et al. (2007), Vol. 78, p. 012057, http://stacks.iop.org/1742-6596/78/i=1/a=012057

[2] I. Bird, P. Buncic, F. Carminati, M. Cattaneo, P. Clarke, I. Fisk, M. Girone, J. Harvey,B. Kersevan, P. Mato et al., Tech. Rep. CERN-LHCC-2014-014. LCG-TDR-002 (2014),http://cds.cern.ch/record/1695401

[3] S. McKee, M. Babik, S. Campana, A.D. Girolamo, T. Wildish, J. Closier, S. Roiser,C. Grigoras, I. Vukotic, M. Salichos et al., Integrating network and transfer metricsto optimize transfer efficiency and experiment workflows (2015), Vol. 664, p. 052003,http://stacks.iop.org/1742-6596/664/i=5/a=052003

[4] A. Hanemann, J.W. Boote, E.L. Boyd, J. Durand, L. Kudarimoti, R. Łapacz, D.M.Swany, S. Trocha, J. Zurawski, PerfSONAR: A Service Oriented Architecture for Multi-domain Network Monitoring, in Service-Oriented Computing - ICSOC 2005, edited byB. Benatallah, F. Casati, P. Traverso (Springer Berlin Heidelberg, Berlin, Heidelberg,2005), pp. 241–254, ISBN 978-3-540-32294-8

[5] E. Martelli, S. Stancu, LHCOPN and LHCONE: Status and Future Evolution (2015),Vol. 664, p. 052025, http://stacks.iop.org/1742-6596/664/i=5/a=052025

[6] S. Shalunov, B. Teitelbaum, A. Karp, J. Boote, M. Zekauskas, RFC 4656, RFC Editor(2006)

[7] R. Quick, M. Babik, E.M. Fajardo, K. Gross, S. Hayashi, M. Krenz, T. Lee, S. McKee,C. Pipes, S. Teige, Journal of Physics: Conference Series 898, 082044 (2017)

[8] I. Vukotic, D. Barberis, F. Legger, R. Gardner (ATLAS Collaboration), ATLAS Analyticsand Machine Learning Platforms (2018)

[9] S. McKee, M. Babik, (June 2020), Osg/wlcg psconfig server, Retrieved from https://psconfig.opensciencegrid.org/

[10] perfSONAR Developers, (June, 2020), pSConfig Web Admin, Retrieved from https://docs.perfsonar.net/pwa.html

[11] M. Babik, (June 2020), Experiments Test Framework (ETF), Retrieved from https://etf.cern.ch/docs

[12] M. Babik, (2019), perfSONAR ETF Monitoring, Retrieved from http://etf.cern.ch/docs/latest/user/overview.html#service

[13] perfSONAR Consortium, (June 2020), perfSONAR Monitoring and DebuggingDashboard (MADDASH), Retrieved from http://psmad.opensciencegrid.org/maddash-webui/index.cgi

[14] E. Inc, (June 2020), Filebeat: Lightweight log analysis & elasticsearch, Retrieved fromhttps://www.elastic.co/beats/filebeat

[15] M. Babik, (June 2020), Network throughput support unit documenta-tion, Retrieved from https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Throughput_Support_Unit

[16] S. McKee, M. O’Connor, (June 2018), Jumbo frame considerations for LH-COPN/LHCONE, Retrieved from https://indico.cern.ch/event/764495/

[17] S. Campana, K. Chadwick, G. Chen, J. Chudoba, P. Clarke, M. Eliáš, A. Elwell,S. Fayer, T. Finnern, L. Goossens et al., WLCG and IPv6 – the HEPiX IPv6 work-ing group (2014), Vol. 513, p. 062026, http://stacks.iop.org/1742-6596/513/i=6/a=062026

[18] M. Babik, H. Borras, Tech. Rep. CERN-IT-Note-2017-001, CERN, Geneva (2017),http://cds.cern.ch/record/2252410

Page 8: WLCG Networks: Update on Monitoring and Analytics arXiv ...understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with

[19] B. Bockelman, (2018), Cc* integration: Service analysis and network diagno-sis (sand), Retrieved from https://www.nsf.gov/awardsearch/showAward?AWD_ID=1827116

[20] D. Weitzel, (June, 2020), Website for service analysis and network diagnosis project,Retrieved from https://sand-ci.org/

[21] S. McKee, (2020), Prototype OSG/WLCG kibana network dashboards, online,Retrieved from https://atlas-kibana.mwt2.org/s/networking/app/kibana#/dashboard/07a03a80-beda-11e9-96c8-d543436ab024?_g=()