Advanced Monitoring Techniques for the ATLAS TDAQ Network

23
HEPiX – 9 May 2008 1 Network monitoring in ATLAS – [email protected] Advanced Monitoring Techniques for the ATLAS TDAQ Network Matei Ciobotaru CERN University of California, Irvine “Politehnica” University of Bucharest on behalf of the ATLAS Networking Group: B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici

description

Advanced Monitoring Techniques for the ATLAS TDAQ Network. Matei Ciobotaru CERN University of California, Irvine “Politehnica” University of Bucharest on behalf of the ATLAS Networking Group: B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici. - PowerPoint PPT Presentation

Transcript of Advanced Monitoring Techniques for the ATLAS TDAQ Network

Page 1: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 1Network monitoring in ATLAS – [email protected]

Advanced Monitoring Techniques for the ATLAS TDAQ Network

Matei CiobotaruCERN

University of California, Irvine“Politehnica” University of Bucharest

on behalf of the ATLAS Networking Group:B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici

Page 2: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 2Network monitoring in ATLAS – [email protected]

The ATLAS TDAQ Network – Role

The ATLAS Trigger and Data Acquisition Network (TDAQ) handles the data transfers from the ATLAS detector to the analysis and storage nodes

Built with Gigabit Ethernet switches and routers

Sustained rates of 150 Gbit/s

The experiment relies on the network to function 24/7 with a minimal number of failures

ATLAS detector

TDAQ system

Page 3: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 3Network monitoring in ATLAS – [email protected]

2 concentrator switches per rack

5 “big” chassis-based devices at the core

The ATLAS TDAQ Network – Photos

Almost 3000 devices and 5000 network connections…

How to make sure everything is working correctly?

2500 computers installed in 90 racks

Page 4: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 4Network monitoring in ATLAS – [email protected]

Inside this talk

Requirements in terms in network management

Commercial software we are using

Tools we developed in-house

Services for users, integration with ATLAS

Plans for the future

The big picture

Page 5: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 5Network monitoring in ATLAS – [email protected]

ATLAS Requirements

Installation– Ease the equipment registration, inventory and verification– Configure the devices

Operation– Check the state of health of devices and links– Monitor traffic conditions, raise alarms when needed– Assist the user in navigating the realm of information– Integration with the ATLAS TDAQ software

Diagnostics– Provide aids to the admin in case something goes wrong– Be able to suggest solutions to problems

Com

plex

ity

Manage a large local area network which has to be very reliable and which has very high throughput requirements

Page 6: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 6Network monitoring in ATLAS – [email protected]

Equipment registration

ATLAS equipment needs to be registered in four databases

Only some databases support batch registrations, others require manual intervention may lead to inconsistencies

Developed a web application to cope with this situation

– Central place for querying all the information about a device

– Ability to cross-check the data across all databases detect incomplete/incorrect registrations

Page 7: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 7Network monitoring in ATLAS – [email protected]

Equipment inventory

Network diagrams for ATLAS are made in Microsoft Visio using the NetDesign package

We created tools which discover what really exists in the network (what is connected where)

Developed an application which compares the two data sources (Visio and Auto-discovery) mismatches are detected and corrected in the field if necessary

For the network documentation – we also generate automatically a printable “report” with all the connectivity

Visio

Network Discovery

Page 8: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 8Network monitoring in ATLAS – [email protected]

Network configuration (1)

In ATLAS we have more than 200 switches

– Different vendors– Different mechanisms for

configuration and monitoring (telnet, SNMP, web)

Q: How to access all devices in a transparent manner?

– A: Bring them all under a common denominator (common interface)

Q: How to automatize network management tasks?

– A: Write scripts (little programs)

sw_script = Set of Python modules which can be used as building blocks for network management solutions

Common programming interface to all devices (object-oriented)

“Intelligent” tools for configuration and monitoring can be developed

switches + scripting = sw_scripthttp://cern.ch/ciobota/projects/sw_script/

Page 9: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 9Network monitoring in ATLAS – [email protected]

Interactive session with sw_script

# Start the Python interpreter$ python2.5

# Load the sw_script module>>> import sw_script

# Create an object associated with the switch (a Cisco device in this case)>>> sw = sw_script.Cisco_Catalyst_6500_Switch(ip_address = “192.168.100.59");

# List the ports available on this device>>> sw.get_port_names(); [’1/1’, ’1/2’, ’1/3’, ’1/4’, ....

# Get all the information available for an interface>>> sw.get(“1/4"); [(’rx_packets’, 519.0), (’rx_bytes’, 127937.0), (’rx_discards’, 0.0), (’rx_errors’, 0.0), (’tx_packets’, 11199.0),(’tx_bytes’, 1111661.0), (’tx_discards’, 0.0), (’tx_errors’, 0.0), (’description’, ’GigabitEthernet1/4’), (’link_state’, ’up’), (’mac_addr’, [’00:90:27:8F:94:E3’])]

# Set the description (ifAlias) of an interface>>> sw.set_interface_alias(“1/4”, “Uplink to Core Router”)

# Show the serial number of this device>>> print sw.get_serial_number() FOC0913U075

sw_script is responsible for more than a half of our network management toolbox

Features– Supports devices from different vendors

– Network topology auto-discovery

– Can do traffic monitoring in real-time

– Works as a module, can be easily embedded into other apps

Page 10: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 10Network monitoring in ATLAS – [email protected]

Network configuration (2)

In ATLAS, we have programs which use sw_script to perform configuration changes on devices:– defining VLANs– enabling protocols: spanning tree, time

synchronization, etc.– setting interface aliases (descriptions)

We use Python scripts to perform unattended firmware upgrades

For keeping track of configuration files we plan to use ZipTie (open-source software)

Page 11: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 11Network monitoring in ATLAS – [email protected]

Basic monitoring

Spectrum from Computer Associates software for device health and traffic monitoring (used by the CERN IT department)

Monitors devices, raises alarms in case of failures Auto-discovery for almost all network connections Historical info – Gathers statistics from all devices

– Throughput and error rates saved every 30 seconds

Limitations– The Spectrum GUI is hard to use– It is not easy to integrate with 3rd party apps– Limited support for network performance monitoring– Basic support for querying historical traffic data– No support for device configuration – Virtually no features for diagnostics

Spectrum GUI

We developed software to fill-in the gaps

Page 12: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 12Network monitoring in ATLAS – [email protected]

Navigating in the realm of monitoring data

Spectrum produces 3 plots for each network interface. We shall have 5000 ports and 15000 plots to look at…

We developed tools to browse, query and analyze the traffic plots.

Page 13: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 13Network monitoring in ATLAS – [email protected]

Network browser

Page 14: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 14Network monitoring in ATLAS – [email protected]

Searching and aggregating plots

Page 15: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 15Network monitoring in ATLAS – [email protected]

Scanning for traffic events

Page 16: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 16Network monitoring in ATLAS – [email protected]

Integration with ATLAS software

Network Panel– Shows network monitoring

information relevant to an ATLAS data acquisition run

Alarm Watcher– Forwards alarms from Spectrum

into the ATLAS “official” messaging channels

IS Feeder– Publish network statistics to the

Information Services, a monitoring sub-system in ATLAS

The network Panel

Page 17: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 17Network monitoring in ATLAS – [email protected]

Network visualization – 2D approach

Application which shows a topological map of the network

Colors the connections in real-time in function of their state and usage

The overloaded links are detected easily

Good navigation features (zoom, pan) Based on GUESS, a Java application

for visualizing graphs– http://graphexploration.cond.org/

We developed a network monitoring plug-in for GUESS

Page 18: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 18Network monitoring in ATLAS – [email protected]

Network visualization – 3D approach (1)

Each object contains a panel with traffic information (updated in real-time)

Containers (racks, rooms) show aggregate values

Technologies used: X3D, Java and the Octaga Player

3D model of the network Racks, switches and computers

Furniture in the 3D space Navigation similar to Google Earth

Page 19: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 19Network monitoring in ATLAS – [email protected]

Network visualization – 3D approach (2)

Page 20: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 20Network monitoring in ATLAS – [email protected]

Real-time traffic monitoring

Connections for one switch (with traffic values)

The ATLAS applications running now in the network

Real-time global top (most active connections)

Page 21: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 21Network monitoring in ATLAS – [email protected]

Diagnostics

For immediate response, we look in Spectrum and in the sw_script web pages

Human inspection of traffic plots (aggregates) – we search for abnormal patterns and correlations between plots

We have a collection of scripts to test different things– Checking that machines are configured properly and

connections are ok

For bandwidth-related issues we use iperf

All the network operations are documented in a knowledge base (wiki)

Page 22: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 22Network monitoring in ATLAS – [email protected]

Plans for the future

Better visualization techniques for traffic plots

Analysis tools for monitoring data. Pattern detection and recognition (periodic events, monotonic variations, etc.)

Add support for sFlow, the standard for statistical sampling – very useful to diagnose network congestion

Design and implement an expert system which will help us troubleshoot network issues

Page 23: Advanced Monitoring Techniques for  the ATLAS TDAQ Network

HEPiX – 9 May 2008 23Network monitoring in ATLAS – [email protected]

The big picture

Historical traffic data

Real-time traffic info

Dynamic web-pagesBrowse, search and

aggregate2D and 3D network

visualization

ATLAS software – network status and alarms

Equipmentconfiguration

Device healthmonitoring

Equipment auto-discovery, inventory and registration

Commercial package In-house development

sw_script & co.Spectrum