IRNC NOC - Internet2meetings.internet2.edu/.../2016/09/...IRNC-NOC-PET.pdf · The IRNC NOC...
Transcript of IRNC NOC - Internet2meetings.internet2.edu/.../2016/09/...IRNC-NOC-PET.pdf · The IRNC NOC...
IRNC NOC
Luke Fowler2016 Internet2 Technology ExchangeMiami
Topics
๏IRNC NOC: Overview
๏Performance Engagement Team (PET)
IRNC Program
๏ International Research Network Connections๏National Science Foundation program๏Funds network infrastructure and other supporting activities such as
measurement and NOC for international science, research, and education.๏ Indiana University GlobalNOC awarded to establish an IRNC NOC
https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503382
IRNC NOC
๏The International Research Network Connections Network Operations Center (IRNC NOC) serves as a cooperative point of contact and communications for IRNC network management, providing consolidated network monitoring, reporting, and operational visibility for the IRNC program. The IRNC NOC facilitates a single set of operational expectations for all IRNC funded infrastructure programs; this enables greater availability of IRNC infrastructure and improves results in troubleshooting multi-domain network issues. A central data repository created by the IRNC NOC provides critical operational information; monitoring data and performance metrics in support of NSF funded science and research.
IRNC NOC
๏24x7x365 NOC support for IRNC infrastructure projects
๏1-855-IRNC-NOC๏ [email protected] // [email protected]๏http://irncnoc.globalnoc.iu.edu/
IRNC NOC๏Service Desk function serves as entry point and single point of contact for
issues reported by users or detected via proactive monitoring๏Creates, maintains, and shepherds trouble tickets for various types of
events:
• Unscheduled outage
• Scheduled maintenance
• Problem report
• Service Request
IRNC NOC
๏The IRNC NOC service desk develops processes/procedures with each IRNC infrastructure project for:
• Event notification (outage/maintenance)• Problem assignment and escalation• Operational/availability reporting• Etc.
Proactive Monitoring๏ IRNC NOC is workings with each INRC infrastructure projects to establish a
monitoring plan tailored to the project๏Leveraging existing GlobalNOC tools, including:
• GlobalNOC Database
• GlobalNOC Alertmon / Auto-monitoring
• SNAPP
๏Some open questions still for projects that are more experimentally focused
Operational Reporting
๏Provide reports on a regular basis for INRC infrastructure, detailing:
• Unscheduled outages
• Scheduled Maintenances
• Events of note
• Infrastructure availability
What We Don’t Do?
๏ IRNC NOC helps find and verify the problems, and track their status from start to finish.
๏We don’t (usually) “actually fix” the problem๏Each IRNC infrastructure project does their own network engineering work
Data Archive
๏Collaborating with IRNC NetSage project (Dr. Jennifer Schopf, IU) to establish a shared data archive of network telemetry data.
๏used by NOC for problem detection, reporting, etc.๏Used by Netsage project for analysis & visualization๏Collects data using a variety of formats/protocols including SNMP, Netflow,
packet trace, etc. ๏Working with IRNC participants on data privacy issues to provide
appropriate data as a publicly available resource while ensuring sensitive data is only available for internal NOC use or summarized reporting.
Leveraging Route Views
๏Beginning work to integrate data from Route Views data into IRNC NOC activity.
๏ Idea: detect and report on ‘interesting’ / ‘important’ routing changes related to IRNC infrastructure
๏Use data from NetSage to identify ’routes of interest’๏Use data from Route Views to detect/observe changes in these routes๏Build operational reports, and potentially eventually pro-active alarming
based on this data๏New staff member beginning to work on this project over the fall.
IRNC NOCPerformance Engagement Team
IRNC NOC is supported by the National Science Foundationaward 1450934
Background๏As network technology becomes more complex and opaque,
troubleshooting performance issues becomes more difficult for the layperson.
๏Trends• Increased Layer2 infrastructure obscures network path
• Heightened security removes public data metrics
• Increased use of network firewalls at the campus level
• Automated data transfer requires 24x7x365 support
๏As infrastructure complexity increases, the researcher is left to determine how to solve performance issues
IRNC PET: Three Charges
1. Drive quick resolution of international inter-domain performance issues2. Build a common performance troubleshooting playbook3. Evolve perfSONAR as a tool for performance incident management
Drive resolution๏Centralized POC to request network troubleshooting assistance๏PET will
• Identify path
• Investigate with network contacts
• Test with available measurement points
• Resolve problems that are resolvable (and acknowledge problems that aren’t)
๏Researchers and network engineers can involve the PET
๏ Issues are tracked in a ticketing system -> creates accountability, metrics, and centralized contact tracking
Performance Playbook๏ IRNC NOC PET will collaborate on, design and maintain a centralized
troubleshooting process with major partner networks
๏ PET will maintain a website with network troubleshooting resources and references
• https://irncnoc.globalnoc.iu.edu/• Have worked 10+ performance issues to refine our internal process and
understanding of where external collaboration is necessary• Performance process on next slide • Will be working with similar performance-focused efforts (eduPERT,
Esnet, GEANT, etc.) to help define standards for collaboration, shared troubleshooting and knowledge capture
Issue Identified?
PET Issue Submitted
Assign PET Case Manager + Systems Engineer
Initial Questions and issue validation
Determine Network or
Systems primary actor
Retrieve or Draw Relevant Maps/
Diagrams
Investigate with Publicly Available Tools
Open tickets w/ relevant networks
Seek Updates every 3 days
Weekly Customer updates
Take Resolution Action Set state to
inactiveSet date for
review
Close Successful
yes
not yet
yes
no
Write After Action Report
future fix identified
Management Review
Continue
Close Unsuccessful
Resolvable?
Notify Customer
Update Diagram
Monthly Customer updates
Halt
date passed
date not passed
discuss
Additional Information
2
1
3 4 5
6.1
6.2
6
6.3
Investigate
7
9
810
11
12
13
14
15
16
17
18 19
20
A month has passed
Can’t Reproduce
Continue
taking too long
Continue
Notify Customer
IRNC PETPerformance
Troubleshooting Principles
• Investigate as much as you can using publicly available monitoring systems and data
• Provide centralized store of troubleshooting information (maps, ticket documentation, findings, etc.)
• More frequent updates to interested parties
• Likely lots of external collaboration required
Evolve perfSONAR
๏perfSONAR is used as the measurement tool of choice for the IRNC NOC. ๏The more perfSONAR enabled test points, the more successful the IRNC
PET will be in assisting researchers without involving the individual network owners
๏ IRNC PET will use experience gained in working cases to provide feedback and enhancements to the perfSONAR project
Year 1 Findings๏ Early involvement in performance troubleshooting process – We’re more effective
the earlier we’re brought in• This largely comes down to awareness of the IRNC NOC PET and its charge
๏ Perfsonar deployments into the campus• Issues tend to be local and the closer to the user the monitoring deployments,
the more troubleshooting work the IRNC NOC PET can do without involving regional and campus resources
• Visibility into network topology, traffic monitors and other data is sometimes restricted for security reasons
๏ Cooperation from peer and campus network engineering who may not see external user performance issues as a priority over their daily workload• We attempt to get around this by being squeaky wheels on behalf of the
researchers, but still….
Year 1 Findings (cont.)๏ Identifying “invisible” infrastructure (Layer2 switches and Firewalls)๏ Collaboration within the community is hugely important
• Documentation of findings will support that• We need a shared database of performance-focused contacts for large (and
small) networks๏ Understanding what network performance should be
• When performance has been bad for a long time, it’s difficult to know what the researcher should be getting
• Researchers sometimes lack the vocabulary or understanding to explain what they expect (e.g. “It just feels wrong”, “The graph looks off”)
Next Steps๏Create performance-focused contact database• Question: should we publish that? How open?
๏Outreach to science communities and R&E networks to make them aware the IRNC NOC PET exists as a resource
๏Continue to gather more experience• Assisting NSF-funded Netsage project in isolate problems on their
perfsonar mesh• May do more generalized perfsonar mesh monitoring beyond those in
the IRNC project
Questions?Chris Robb – [email protected] Fowler – [email protected] NOC: [email protected] a Performance Issue: [email protected]
IRNC NOC is supported by the National Science Foundationaward 1450934