INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ±...

34
INFRASTRUCTURE

Transcript of INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ±...

Page 1: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

INFRASTRUCTURE

Page 2: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Production Engineer, PE Network Monitoring

Fault Detection at Scale

Giacomo Bagnoli

INFRASTRUCTURE

Page 3: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Monitoring the network

• How and what

• NetNORAD

• NFI

• TTLd

• Netsonar

• Lessons learned

• Future

Agenda

Page 4: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Network Monitoring

Page 5: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Fabric Networks

• Multi stage CLOS

topologies

• Lots of devices and

links

• BGP only

• IPv6 >> IPv4

• Large ECMP fan out

Page 6: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Prineville, OR

Los Lunas, NM

Papillion, NE

Fort Worth, TX

Forest City, NC

Altoona, IA Clonee, Ireland

Luleå, Sweden

Odense, Denmark

Page 7: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Why is passive not enough?

Active Network Monitoring

• SNMP: trusting the network devices

• Host TCP retransmits: packet loss is everywhere

• Active network monitoring:

• Inject synthetic packets in the network

• Treat the devices as black boxes, see if they can forward traffic

• Detect which service is impacted, triangulate loss to

device/interface

Page 8: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators
Page 9: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

NetNORAD, NFI, NetSONAR, TTLd

Page 10: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

NetNORAD Rapid detection of faults

TTLd End-to-end retransmit and loss detection using production traffic

What!?

NFI Isolate fault to a specific device or link

Netsonar Up/Down reachability info

Page 11: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• A set of agents injects synthetic UDP

traffic

• Targeting all machines in the fleet

• targets >> agents

• Responder is deployed on all machines

• Collect packet loss and RTT

• Report and analyze

NetNORAD

Network

Page 12: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

NetNORAD - Target Selection

FRC

FRC1

FRC2

P01 P02

P02 P01

CLN

CLN1

CLN2

P01 P02

P02 P01

DC

REGION

GLOBAL

Page 13: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

NetNORAD – Data Pipeline

Alarming Timeseries SCUBA

• Agents reports to a fleet of

aggregators

• pre-aggregated results data per target

• Aggregators

• calculate per-pod percentiles for

loss/RTT

• augment data with locality info

• Reporting

• to SCUBA

• timeseries data

• alarming

Locality

Page 14: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Observability

Using SCUBA

• By scope – isolates issue to

• Backbone, region, dc • By cluster/pod – isolates issue

to

• A small number of FSWs • By EBB/CBB

• Is replication traffic affected?

• By Tupperware Job

• Is my service affected?

Page 15: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Gray Network failures

• Detect and triangulate to device/link

• Auto remediation

• Also useful dashboards (timeseries and SCUBA)

Network Fault Isolation

NFI

Page 16: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Probe all paths by rotating the SRC

port (ECMP)

• Run traceroutes (also on reverse

paths)

• Associate loss with path info

• Scheduling and data processing

similar to NetNORAD

• Thrift – based (TCP)

How (shortest version possible)

NFI

Page 17: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Blackbox monitoring tool

• Sends ICMP probes to network switches

• Provides reachability information: is it up or down?

• Scheduling and data pipeline similar to NetNORAD and NFI

NetSONAR

Page 18: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Main goal: surface end to end retransmits throughout the

network

• Use production packets as probes

• A mixed approach (not passive, not active)

• End host mark one bit in the IP header when the packet is a

retransmission

• Uses MSB of TTL/Hop Limit

• Marking is done by an eBPF program on end hosts

• A collection framework collect stats from devices (sampled

data)

Mixed Passive / Active approach

TTLd

Page 19: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Visualization • High density dashboards

• Using cubism.js

• Fancy javascript UIs

• (various iterations)

• Other experimental views

Page 20: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Examples

Page 21: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Example Fabric Layout

RSW

FSW

SSW

Page 22: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Clear signal in NetNORAD

• Triangulated successfully by

NFI

• Also seen in passive

collections

Bad FSW causing 25% loss in POD

Example 1

Page 23: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Low loss seen in NetNORAD

• NFI drained the device

• But the drain failed

• NFI alarms again

• Clear signal in TTLd too

Bad FSW could not be drained (failed automation)

Example 2

Page 24: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Congestion happens

• NFI uses outlier detection

• Not perfect

• Loss in NetNORAD was

just limited to a single

DSCP

Congestion, false alarm

Example 3

Page 25: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Lessons Learned

Page 26: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Having multiple tools helps

• Separate failure domains

• Separation of concerns

• But also adds a lot of overhead

• Reliability:

• Regressions are usually the biggest problem

• Holes in coverage are the next big problem

• Dependency / cascading failures

Multiple tools, similar problems

Lesson learned

Page 27: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• After validating the proof of concept

• How to make sure it continues working?

• … that it can detect failures

• … maintaining coverage

• … and keeping up with scale

• … and doesn’t fail with its dependencies?

i.e. how to know we can catch events reliably.

How to avoid regressions or holes?

Page 28: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• It’s a function of time and space

• e.g. we cover the 90% of the devices 99% of the time

• Should not regress

• New devices should be covered once provisioned

• Monitor and alarm!

Coverage

Page 29: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Find a way to categorize events

• Possibly automatically

• Measure and keep history

• Make sure there’s no regressions

false positive vs true positive

Accuracy

Page 30: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• How do we know we can detect events?

• Before we get an event, possibly!

• End-to-End (E2E) testing:

• Introduce fake faults and see if the tool can detect them

• Usually done via ACL injection to block traffic

• Middle-to-End (M2E) testing:

• Introduce fake data in the aggregation pipeline

• Useful for more complex failures

Not just for performances

Regression detection

Page 31: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Time to detection

• Time to alarm

• But also more classic metrics (cpu, mem, errors)

• Measure and alarm!

Performance

Page 32: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Degrade gracefully during large scale events

• i.e. what if SCUBA is down?

• or the timeseries database?

• “doomsday” tooling: • Review dependencies, see if you can drop as many as we can

• Provide a subset of functionalities

• Make sure it’s user friendly (both the UI and the help) • Make sure it’s continuously tested

Or how to survive when things start to fry

Dependencies failures

Page 33: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

Future

Page 34: INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

• Keep up with scale

• Support new devices and

networks

• Continue to provide a stable signal

• Exploring ML for data analysis

• Improve coverage

Lots of work to do

Future work