INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ±...

INFRASTRUCTURE

Production Engineer, PE Network Monitoring

Fault Detection at Scale

Giacomo Bagnoli

INFRASTRUCTURE

• Monitoring the network

• How and what

• NetNORAD

• NFI

• TTLd

• Netsonar

• Lessons learned

• Future

Agenda

Network Monitoring

Fabric Networks

• Multi stage CLOS

topologies

• Lots of devices and

• BGP only

• IPv6 >> IPv4

• Large ECMP fan out

Prineville, OR

Los Lunas, NM

Papillion, NE

Fort Worth, TX

Forest City, NC

Altoona, IA Clonee, Ireland

Luleå, Sweden

Odense, Denmark

Why is passive not enough?

Active Network Monitoring

• SNMP: trusting the network devices

• Host TCP retransmits: packet loss is everywhere

• Active network monitoring:

• Inject synthetic packets in the network

• Treat the devices as black boxes, see if they can forward traffic

• Detect which service is impacted, triangulate loss to

device/interface

NetNORAD, NFI, NetSONAR, TTLd

NetNORAD Rapid detection of faults

TTLd End-to-end retransmit and loss detection using production traffic

What!?

NFI Isolate fault to a specific device or link

Netsonar Up/Down reachability info

• A set of agents injects synthetic UDP

traffic

• Targeting all machines in the fleet

• targets >> agents

• Responder is deployed on all machines

• Collect packet loss and RTT

• Report and analyze

NetNORAD

Network

NetNORAD - Target Selection

P01 P02

P02 P01

P01 P02

P02 P01

REGION

GLOBAL

NetNORAD – Data Pipeline

Alarming Timeseries SCUBA

• Agents reports to a fleet of

aggregators

• pre-aggregated results data per target

• Aggregators

• calculate per-pod percentiles for

loss/RTT

• augment data with locality info

• Reporting

• to SCUBA

• timeseries data

• alarming

Locality

Observability

Using SCUBA

• By scope – isolates issue to

• Backbone, region, dc • By cluster/pod – isolates issue

• A small number of FSWs • By EBB/CBB

• Is replication traffic affected?

• By Tupperware Job

• Is my service affected?

• Gray Network failures

• Detect and triangulate to device/link

• Auto remediation

• Also useful dashboards (timeseries and SCUBA)

Network Fault Isolation

• Probe all paths by rotating the SRC

port (ECMP)

• Run traceroutes (also on reverse

paths)

• Associate loss with path info

• Scheduling and data processing

similar to NetNORAD

• Thrift – based (TCP)

How (shortest version possible)

• Blackbox monitoring tool

• Sends ICMP probes to network switches

• Provides reachability information: is it up or down?

• Scheduling and data pipeline similar to NetNORAD and NFI

NetSONAR

• Main goal: surface end to end retransmits throughout the

network

• Use production packets as probes

• A mixed approach (not passive, not active)

• End host mark one bit in the IP header when the packet is a

retransmission

• Uses MSB of TTL/Hop Limit

• Marking is done by an eBPF program on end hosts

• A collection framework collect stats from devices (sampled

Mixed Passive / Active approach

Visualization • High density dashboards

• Using cubism.js

• Fancy javascript UIs

• (various iterations)

• Other experimental views

Examples

Example Fabric Layout

• Clear signal in NetNORAD

• Triangulated successfully by

• Also seen in passive

collections

Bad FSW causing 25% loss in POD

Example 1

• Low loss seen in NetNORAD

• NFI drained the device

• But the drain failed

• NFI alarms again

• Clear signal in TTLd too

Bad FSW could not be drained (failed automation)

Example 2

• Congestion happens

• NFI uses outlier detection

• Not perfect

• Loss in NetNORAD was

just limited to a single

Congestion, false alarm

Example 3

Lessons Learned

• Having multiple tools helps

• Separate failure domains

• Separation of concerns

• But also adds a lot of overhead

• Reliability:

• Regressions are usually the biggest problem

• Holes in coverage are the next big problem

• Dependency / cascading failures

Multiple tools, similar problems

Lesson learned

• After validating the proof of concept

• How to make sure it continues working?

• … that it can detect failures

• … maintaining coverage

• … and keeping up with scale

• … and doesn’t fail with its dependencies?

i.e. how to know we can catch events reliably.

How to avoid regressions or holes?

• It’s a function of time and space

• e.g. we cover the 90% of the devices 99% of the time

• Should not regress

• New devices should be covered once provisioned

• Monitor and alarm!

Coverage

• Find a way to categorize events

• Possibly automatically

• Measure and keep history

• Make sure there’s no regressions

false positive vs true positive

Accuracy

• How do we know we can detect events?

• Before we get an event, possibly!

• End-to-End (E2E) testing:

• Introduce fake faults and see if the tool can detect them

• Usually done via ACL injection to block traffic

• Middle-to-End (M2E) testing:

• Introduce fake data in the aggregation pipeline

• Useful for more complex failures

Not just for performances

Regression detection

• Time to detection

• Time to alarm

• But also more classic metrics (cpu, mem, errors)

• Measure and alarm!

Performance

• Degrade gracefully during large scale events

• i.e. what if SCUBA is down?

• or the timeseries database?

• “doomsday” tooling: • Review dependencies, see if you can drop as many as we can

• Provide a subset of functionalities

• Make sure it’s user friendly (both the UI and the help) • Make sure it’s continuously tested

Or how to survive when things start to fry

Dependencies failures

Future

• Keep up with scale

• Support new devices and

networks

• Continue to provide a stable signal

• Exploring ML for data analysis

• Improve coverage

Lots of work to do

Future work

INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ±...

Documents

Transcript of INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ±...

BDSRA 2015 CLN1 Wishart

Persistent rigid-body motions and study’s “Ribaucour” problem · Vol. 108 (2017) Persistent rigid-body motions 151 Sd = 0 −P03,P02 P23 +pP01 P03 0 −P01 P31 +pP02 −P02

&R /WG - taipo-tech.com · / About Us / Our Advantages / Solar Products / Customer Cases / FAQ About Us P01 P02 P04 P14 P16 CONTENTS Shenzhen Taipo Energy Technology Co.,Ltd is a

P01 / P02 / P030104.nccdn.net/1_5/1ab/1e9/0f7/P01-P02-P03.pdf · 2013-11-27 · Customer name/logo Blind embossed Embossed + foilprinted 1 colour Foilprinted 1 colour Colours Housing

MINI CRANE CRAWLER · Enterprise Brief Introduction P01 P02 Henan SPT Machinery Equipment Co.,Ltd was Founded In 2015. SPT s consistent research and develop program ensures the manufacture

16-bit Single Chip Microcontroller - Epson · 16-bit Single Chip Microcontroller 16KB/32KB Flash ROM with read/program protection function ... assignment P01/EXCL01/UPMUX P02/BZOUT/UPMUX

P01 examinations P02 university fairs P03 olympiads and ...lis.sc.ke/wp-content/uploads/2016/04/LIS_Newsletter_April-2016_144.pdfstudents and teachers take part in various academic

S8000-TECHNICAL DATA(P01) P01 (1) - Commdoor Aluminum

High profile apartment development opportunity...New Apartments Proposed Visual 18.08.2016 4609-A-5012 Ful-App SS P01 18.08.2016 First issue SS P02 23.08.2016 Elevations revised SS

2017 BDSRA Cooper, Nelvagal, Egeland, Najafi, Assis and Repschlager CLN1, CLN2, CLN3

p01 · Title: p01 Created Date: 20151126092143Z

LINEAR MOTORS P01-23x80 · 2019. 1. 4. · 3 28 LINEAR MOTORS P01/P02 EDITION 24 SUBECT TO ALTERATIONS / CONTENTS / LINEAR MOTORS P01-23x80 Technical Data 29 Motor Specifications

Hitachi AC Servo Drive - esco.be · Hitachi AC Servo Drive AD Series Setup Software AHF-P01/P02 Instruction Manual Thank you for purchasing Hitachi AC Servo Drive Setup Software.

SYLLABUS · 1 SYLLABUS CUIN 3003 Educational Foundations Spring 2018 Instructor: Ms. Johnie C. Walker Section # and CRN: P01 23644 P02 20014 Office Location: Wilhelmina F. Delco Building,

SECTIONAL DIRECTIONAL HDS 11 VALVE SERIES · SECTIONAL DIRECTIONAL VALVE HDS 11 SERIES Outlet manifold (std) with T and open center P01 P02 P04 P05 Outlet manifold with T and H.P.C.O.

RFP - Outfitting Works Furniture and Equipment for the Office of … · 2019-06-29 · - ARQ 1.13: Detail- Kitchen2 Block 3 - ARQ 1.14: Detail- P01/P02 - RIS 2.01 General Plan Block

p02 · Title: p02 Created Date: 1/15/2020 10:49:11 AM

2016 BDSRA Mole CLN1, CLN2, CLN3, CLN5, CLN6, CLN7,

corrosion p02

CMOS single -chip 8-bit · PDF file- P02 (MC97F6108A DSCL pin) - P01 (MC97F6108A DSDA pin) Figure 1.1 On Chip Debugger 2 and Pin description (OCD2 mode) Connection 2: - P11