Diagnosing Internet Outages

20
Diagnosing Internet Outages Young Xu, Product Marketing Analyst

Transcript of Diagnosing Internet Outages

Diagnosing Internet Outages Young Xu, Product Marketing Analyst

2

About ThousandEyes ThousandEyes delivers visibility into every network your organization relies on.

Founded by network experts; strong

investor backing

Relied on for "critical operations by leading enterprises

Recognized as "an innovative "

new approach

27 Fortune 500

5 top 5 SaaS Companies 4 top 6 US Banks

3

I see an outage. Is this affecting! just this one test, just me, or

everybody?!

4

•  Detect outages in ISPs and understand their impact both globally and as it relates to your organization

Overview: Internet Outage Detection

•  See the global and account scope, as well as likely root cause of BGP reachability outages

Traffic Outage Detection Routing Outage Detection

5

1.  Anonymized traffic data is aggregated from all tests across the entire user base 2.  Algorithms then look for patterns in path traces terminating in the same ISP

How Traffic Outage Detection Works

New York Cloud Agent

Boston Enterprise Agent

Los Angeles Cloud Agent

Level 3 in San Jose

Cogent in Denver

Salesforce

Google

NY Times

Customer 2

Customer 1

6

Traffic Outage Detection Account scope

Global scope

Severity and scope of the issue at this interface

7

Routing Outage Detection Aggregates reachability issues in routing data from 300+ public monitors

Global scope

Account scope

Root cause analysis

8

•  April 23: Hurricane Electric route leak affecting AWS •  May 3: Trans-Atlantic issues in Level 3

–  https://blog.thousandeyes.com/trans-atlantic-issues-level-3-network/ •  May 20: Tata and TISparkle issues with submarine cable

–  https://blog.thousandeyes.com/smw-4-cable-fault-ripple-effects-across-networks/ •  June 6: Hurricane Electric removed >500 prefixes •  June 24: Tata cable cut in Singapore affecting Dropbox •  July 10: Level 3, NTT routing issues affecting JIRA

–  https://blog.thousandeyes.com/identifying-root-cause-routing-outage-detection/ •  July 17: Widespread issues in Telia’s network in Ashburn

–  https://blog.thousandeyes.com/analyzing-internet-issues-traffic-outage-detection/

Recent Major Outages Detected

9

•  Look for purple indicators and the ‘Outage Detected’ dropdown when investigating issues—these indicate detected outages!

•  Use quick links or select specific nodes/ASes to see how paths have changed over time

•  Correlate data from the web, network and routing layers to analyze root cause

•  See our blogs and Knowledge Base articles for more info: –  Blog on Traffic Outage Detection

–  https://blog.thousandeyes.com/analyzing-internet-issues-traffic-outage-detection/ –  Blog on Routing Outage Detection

–  https://blog.thousandeyes.com/identifying-root-cause-routing-outage-detection/ –  Knowledge Base:

–  https://support.thousandeyes.com/entries/110214366

Tips for Diagnosing Internet Outages

10

Demo

11

1. Network Layer Issues in Telia in Ashburn

Detected outage coincides with packet loss spikes

Ashburn, VA is “ground zero” for this outage

12

Specific Failure Points in Telia

High severity and wide scope (Outages affecting at least 20 tests for a NA/EU interface are likely to be wide in scope)

Terminal nodes in Telia

13

2. Hurricane Electric Route Flap Affects Telx

Detected outage coincides with spike in AS path changes

Root cause analysis points to Hurricane Electric and Telx

14

Route Flap by Hurricane Electric

Hurricane Electric

Routes flap from using HE to NTT, then back to HE

15

Causing Traffic Issues in Hurricane Electric

Hurricane Electric

16

3. NTT and Level 3 Routing Issues Affect JIRA

JIRA saw 0% availability and 100% packet loss

Most affected interfaces are in Ashburn, VA

17

Traffic Terminating in NTT

Traffic paths originally traversed Level 3 and NTT

Traffic paths then change to traverse only NTT, terminating there

18

JIRA’s /24 Prefix Becomes Unreachable

As the primary upstream ISP, Level 3 is associated with the most affected routes Routes through

upstream ISPs NTT and Level 3 all withdrawn

19

Routers Begin Using Misconfigured /16 Prefix

The backup /16 prefix directs to NTT, not JIRA’s network. This is why the traffic path changed to traverse only NTT, terminating there when JIRA’s IP couldn’t be found in NTT’s network.

See what you’re missing.

Watch the webinar:

https://www.thousandeyes.com/resources/diagnosing-internet-outages-webinar