Network Incident Handling

Experiment Support

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

WLCG Grid Deployment Board

November 2010

Alessandro Di Girolamo, John Shade, Jamie Shiers

(Revised 2 December 2010)

CERN IT Department

CH-1211 Geneva 23

ES Overview

• The following is our understanding of how inter-site network incidents should be handled in WLCG

• The procedure is simple, involves a well-defined set of actors and is applicable to both LHCOPN and GPN incidents

• But first, a problem statement

CERN IT Department

CH-1211 Geneva 23

ES Network Incidents

1. Case 1: a “clean” cut of link between site A & B.– Traffic automatically rerouted, interruption transparent

– Failover / back sometimes manual

– No major service disruption; well understood

2. Case 2: a degradation – lower than expected transfer rates and/or high failure rates

– These are the cases that sometimes take a long time to diagnose and on which we will focus

• The procedure is applicable to both, but case 1 is not (really) a problem

CERN IT Department

CH-1211 Geneva 23

edoardo.martelli@cern.ch 20100916

T0-T1 and T1-T1 traffic

Not deployed yet

(thick) >=10Gbps

(thin) <10Gbps

UK-T1-RAL█ █ █ █ AS 43475130.246.178.0/23130.246.152.240/28

DE-KIT █ █ █ █AS34878 192.108.45.0/24192.108.46.0/23

IT-INFN-CNAF█ █ █ █AS137 192.135.23.0/24131.154.128.0/17

FR-CCIN2P3█ █ █ █AS789 193.48.99.0/24

NDGF█ █AS39590109.105.124.0/22193.10.122.0/23193.10.124.0/24

CA-TRIUMF█AS36391206.12.1.0/24206.12.9.64/28

US-FNAL-CMS █AS 3152 131.225.2.0/24131.225.160.0/24131.225.184.0/22131.225.188.0/22131.225.204.0/22

US-T1-BNL █AS43 130.199.185.0/24130.199.48.0/23130.199.54.0/24192.12.15.0/24

█ = Alice █ = Atlas

█ = CMS █ = LHCb

p2p prefix: 192.16.166.0/24

TW-ASGC█ █AS24167 117.103.96.0/20140.109.98.0/24140.109.102.0/24202.169.168.0/22

CH-CERN █ █ █ █ AS513

128.142.0.0/16

= internet backup available

T1-T1 traffic only

10G Renater

10G Geant – Rediris - Anella

2x8.5Gb USLHCnet - ESnet

3.5Gb USLHCnet - ESnet

ES-PIC█ █ █AS43115193.109.172.0/24

NL-T1█ █ █AS1126145.100.32.0/22145.100.17.0/28AS1104194.171.96.128/25

1.8G ASnet-USLHCnet10G ASnet-Surfnet-Geant

10G Nordunet-Geant

10G JANET(UK)-Geant

10G Surfnet-Geant

10G GARR-Geant

anarie-Surfnet-G

1G Canarie-Surfnet-Geant2x8

.5Gb USLHCnet -

2G ASnet

1G Canarie-ESnet1G Canarie-Surfnet

10G Nordunet-Surfnet

2x1G local interconnection

ant–R

ediris-A

10G Surfnet-DFN 10G DFN-SWITCH-GARR

10G DFN-Renater

1G USLHCnet-ESnet

The LHCOPN

CERN IT Department

CH-1211 Geneva 23

UK-T1-RAL█ █ █ █ AS 43475130.246.178.0/23130.246.152.240/28

DE-KIT █ █ █ █AS34878 192.108.45.0/24192.108.46.0/23

IT-INFN-CNAF█ █ █ █AS137 192.135.23.0/24131.154.128.0/17

FR-CCIN2P3█ █ █ █AS789 193.48.99.0/24

NDGF█ █AS39590109.105.124.0/22193.10.122.0/23193.10.124.0/24

CA-TRIUMF█AS36391206.12.1.0/24206.12.9.64/28

US-FNAL-CMS █AS 3152 131.225.2.0/24131.225.160.0/24131.225.184.0/22131.225.188.0/22131.225.204.0/22

US-T1-BNL █AS43 130.199.185.0/24130.199.48.0/23130.199.54.0/24192.12.15.0/24

TW-ASGC█ █AS24167 117.103.96.0/20140.109.98.0/24140.109.102.0/24202.169.168.0/22

CH-CERN █ █ █ █ AS513

128.142.0.0/16

ES-PIC█ █ █AS43115193.109.172.0/24

NL-T1█ █ █AS1126145.100.32.0/22145.100.17.0/28AS1104194.171.96.128/25

File Transfers: Throughput and Failure Rates

VOs “view” of the LHCOPN

CERN IT Department

CH-1211 Geneva 23

ESSite A█AS36391206.12.1.0/24206.12.9.64/28

Site B █ █ █AS1126145.100.32.0/22145.100.17.0/28AS1104194.171.96.128/25

• VO X observes high failure rates / low performance in transfers between sites A & B

• After basic debugging declared a “network issue”

• Site responsibles at both site A & B informed (ticket)

• They are responsible for updating it and for interactions with network contacts at their respective sites

– Ticket ownership follows FTS model – i.e. destination site

• All additional complexity – e.g. Domain C and possibly others – transparent to VO X

– NRENs, GEANT, USLHCNET, etc.

Network Degradation

Domain A

Domain B

Domain C

CERN IT Department

CH-1211 Geneva 23

ES “Network Guys”

• Are responsible for ensuring that the link between the sites is (or becomes) “clean”

• This includes interaction with all relevant partners “in the middle”

• Discussions between “network guys” should be transparent to VO & site representatives

• Site reps get info from their local network reps and ensure ticket is updated

CERN IT Department

CH-1211 Geneva 23

ES Current Situation

• Assuming that this understanding is correct (which in itself would be a first step), we still see situations where: 1. Tickets are not regularly updated

2. Tickets are not updated on “change of state”.• E.g. experiment observes much better transfers but

no record of what was done to achieve this

Q: is this understanding correct?

Q: can we improve on its implementation?

CERN IT Department

CH-1211 Geneva 23

ES Summary

• A simple model for handling network problems has been discussed at the last LHC OPN meeting

• It applies not only to OPN but also non-OPN links – highly desirable

• If the model is agreed we need to formalize it and implement it

• IMHO regular ticket updates are also an important component of the model

CERN IT Department

CH-1211 Geneva 23

ES Network Debugging

• Network contacts A & B are responsible for ensuring that there connection to external network and end-to-end link between sites are “clean”

• Network contacts inform site contacts of change of state (preferably) or site contacts poll

• Experiments request at least daily update of ticket and on all change of state – at least for those tickets marked high priority

• Problem followed until solved

• Escalation, if required, based on common WLCG timelines

CERN IT Department

CH-1211 Geneva 23

ES backup

Presentation Title - 12

CERN IT Department

CH-1211 Geneva 23

ES Network Degradation

• VO X observes high failure rates / low performance in transfers between sites A & B

• After basic debugging (e.g. to see that its not a simple problem with e.g. source or destination SE and/or transfer s/w), declared a “network issue”

• Site responsibles at both site A & B informed (ticket)

• They now “own” ticket and are responsible for updating it and for interactions with network contacts at their respective sites

Network Incident Handling

Documents

Transcript of Network Incident Handling

Incident Handling and Computer Forensics. NIST Publication 800-61 – Computer Security Incident Handling Computer Security Incidence – Denial of Service.

· PDF fileSANS SEC560: Network Penetration Testing and Ethical Hacking — SANS SEC504: Hacker Tools, Techniques, Exploits and Incident Handling

Nist Computer Security Incident Handling Guide

Large Scale Incident Handling - Europa

Information Security Incident Handling · 2020. 5. 5. · Information Security Incident Handling [ISPG-SM02] Version 1.1 November 2017 ... 10.5 Investigation and Prosecution ... Awareness

CERT.br Incident Handling and Network Monitoring … Handling and Network Monitoring Activities ... organizations involved in incidents ... recovery process

16. Exercise: Mobile incident handling incid… · incident handling and forensics with mobile/smartphone computing platforms. Targeted Audience Technical CERT staff Total Duration

EC-COUNCIL CERTIFIED INCIDENT HANDLER · The EC-Council Certified Incident Handler ... in handling and responding to various security incidents such as network security ... Sample

Network incident communications

Major Incident Handling Process

CSIRT – Incident handling

Guide to Malware Incident Prevention and Handling

Incident handling during attack on Critical Information ...

Computer Security Incident Handling Guide (Draft)

Hacker Tools, Techniques, Exploits, and Incident Handling

JPCERT/CC Incident Handling Report[July 1, 2018 ...JPCERT/CC Incident Handling Report [July 1, 2018 – September 30, 2018] 1. About the Incident Handling Report JPCERT Coordination

Incident Handling Foundations

Incident Handling Management Handbook | ENISA

Building and advancing incident response in Europe · 2018-11-28 · Mobile threats incident handling Network forensics Digital Identification and handling of electronic evidence

Incident Handling