Post on 18-Dec-2015
An Annotation Layer for Network Management
George Porter, Arne Baste,
David Chu, Dilip Joseph
Randy H. Katz
NetRads Retreat - June 2005
Goal of today’s talk
Snapshot of our thinking in this area Several open research problems as to
appropriateness of piggybacking, effectiveness of distributed observation, etc.
Your feedback appreciated
Outline
Motivating example: Discovering and protecting network service performance during stress
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
Outline
Motivating example: Discovering and protecting network service performance
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
Dist Tier
Motivating Example:Network service slowdown/failure
Problem: Users in the access tier complain of slow web access, can’t mount files,
and “DNS operation timed out messages” This problem started today at 10am
Where to begin? Network connectivity between users and outside seems ok But name resolution is intermittent and slow We need tools to figure out who is affected, who isn’t affected, the
cause, and a solution.
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
DNS
Dist Tier
Motivating Example:Network service slowdown/failure
Network connectivity to DNS? [ping,traceroute] Are DNS requests making it to the server tier?
What is happening to the request completion rate (is it lower)? Vs network path losses (I.e., is it the path or the service?) DNS server CPU level up
Localize the problem: Only this user? Or other clients? Just that server? What is happening to the DNS req/reply completion rate of
other servers in that cluster? Correlations? Is this user anomalous?
So far: DNS overloaded, leading to timeouts on client end
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
DNS
Dist Tier
Why is the service overloaded? Is there an usual number of requests from other sources? [deviation from
the mean] What is the status of requests to this service network-wide? How has it
changed since before the first reports of the problem? We discover that the number of DNS requests from access and ISP
networks is unchanged (must be in server tier) Other correlations? Yes, to SMTP traffic at ISP ingress
We suspect the endpoint of SMTP traffic, a spam appliance, as the cause of DNS performance loss
No unusual surges of DNS from access or ISP (from outside our enterprise network) Thus originating inside the server tier And correlated to SMTP traffic
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
RII
SMTP
Dist Tier
Eliminate false positives: testing this conjecture via experimental intervention Temporarily b/w throttle SMTP traffic from ISP ingress Test DNS latency from access network Find that DNS latency goes down when SMTP volume goes down
We enact a new (but temporary) policy: Redirect requests from access tier to secondary or tertiary DNS server
(service separation for different users) BW regulate SMTP traffic to keep DNS server CPU load from peaking Access users’ service restored--their traffic is protected.
Problem localized and mitigated Long term solution: software upgrade, firmware upgrade, add
dedicated DNS cache for appliance
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
RII
SMTP
DNS
Example Review
Localizing and identifying problem required Network-wide visibility despite stressed links/servers Path information (network connectivity, protocol request/reply
completion information) Finding changes in behavior (avg # requests/unit time, rate of
change of traffic) Finding correlations between traffic (traffic classes, volume,
network level paths) Experimental intervention (correlation to causation) Enabling new policy (redirecting traffic to secondary server,
BW throttling/fencing misbehaving flows)
Principles for network management Network-wide visibility despite
surges/overload/high loss rates Low overhead Path statistics gathering Some protocol visibility (TCP,
IP, Services like DNS, NFS) Need to discover
Changes to request-reply rate, completions, latency over time
Correlations between different flows, protocols, parts of the network
New policies (Actions) For experimental intervention
(root cause discovery) To protect good traffic
BW shaping, blocking, scheduling, fencing, selective drop
Security Against non-operators using
this infrastructure Against DoS attacks
Outline
Motivating example: Discovering and protecting network service performance
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
PNEs (Programmable Network Elements) and iBoxes Inspection-and-action points
Deep, multiprotocol, packet inspection No routing, just observation and marking Actions: Selective drop, b/w fencing and shaping, notification of
operators, query “points of observation”
Some protocol visibility to TCP, UDP, ‘good’ network service protocols like DNS/NFS
Per-flow session state and reverse path visibility Per-flow and per-path simple statistics gathering (latencies,
round trip times, requests/sec, address source and destinations)
iBox
Annotation Layer
Explicit layer for iBox-to-iBox communication via packet annotations
Annotations: Fixed size Encoded to enable the de-annotation of packets Multiple payload types based on any layer of the flow Security field for authentication
iBox
iBox
iBox
url: X
A-Layer Annotation Design0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Annotation Layer Payload
Prior Protocol
Destination AddressSource Address
TypeAuthentication Field
Sequence Number
12 bytes of payload in one AL unit
AL unit headers (14 bytes)
Authentication Field(10 bytes)
Encode annotations in between IP and transport Allow annotations to be stacked (multiple) Annotations are removed by iBoxes before reaching
endhosts Motivation: start with large (but versatile) annotation
format When we discover the set of annotations that are most
effective for network management, we can reduce the footprint to support that set
iBox placementIn an Enterprise Network: iBoxes at points of hierarchical division
R
R
DistributionTier
B
C
D
S
S
II
R IA
A
InternetEdge
AccessEdge
ServerEdge
SpamAppliance
Primary &Secondary
DNSServers
ISSMail
Server
S
10.0.0.110.0.0.2...10.0.0.100
10.0.0.10110.0.0.102......10.0.0.255
These locations give iBoxes ability to monitor and classify traffic flowing through them. Also, iBoxes can slow down, block, fence, and drop traffic to ease surges and protect “good” traffic from bad/ugly traffic
Routing to other iBoxes
Once we know which iBoxes exist, we need to know how to reach them so we can send them annotations
Requires building up this table at each iBox Topology dependent
If a packet’s destination address doesn’t match an iBox in this table, we remove all annotations to ensure endhost correctness
IPv4/v6 Address iBox ID169.229.62/24 A169.229.60/24 B169.229/16 C128.40.1.3/32 D128.40.1.4/32 B0/0 none
Represents “core” iBoxes
Represents “edge” iBoxes
Active vs Passive annotations When to send “active” annotations (I.e., a separate packet) vs when
to passively annotate? Available during high traffic (passive) vs expedient (active) Associate timers with each queue When packet arrives and an annotation is dequeued, we reset the
timer If the timer goes off, we generate a new dummy packet, annotate it,
send it off to the right destination iBox, and reset the timer
ABCDE
IPv4/v6 Address iBox ID169.229.62/24 A169.229.60/24 B169.229/16 C128.40.1.3/32 D128.40.1.4/32 B0/0 none
Outline
Motivating example: Discovering and protecting network service performance
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
A-Layer as component building blocks for observe-analyse-act Observe
Path statistics; req/reply completion rate,latency; new conn rate; connection age; protocol types/mixtures; their change over time
Analyse Correlations; mean changing over time (chi-sq); PCA;
experimental intervention (act, then observe)
Act BW throttling, selective drop, packet scheduling, bw
fencing
Centralized More control, consistent
information (but could be out of date)
Centralize policy (no need to cast policy over multiple nodes)
Distributed routing preferred over centralized approach Similar motivation for
iBoxes/A-Layer
Why Distributed observe-analyse-act? Distributed Quick distribution of information Need for information throughout
the network Works during network partitions,
provides visibility during surges when it is hard to get packets through
Up-to-date info, but might be inconsistent
But, consistency hard; could start bad feedback loops; need to elect leader
Outline
Motivating example: Discovering and protecting network service performance
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
Dist Tier
Path-oriented connectivity and reachability Network service monitoring
Are requests getting through? What is their rate? What has been happening to the DNS latency? Where are “DNS hotspots”?
iBoxes can store characteristics of paths through the network Types of protocols they see, volume of protocols, rate of change of traffic,
distribution of source/destination addresses seen, network errors, topology information
NetFlows as statistics gathering at a single point Extract and share reports from this information
Annotate packets with IBox Source annotation to have access to inside-vs-outside/paths chosen and paths taken
Annotate packets with service reachability reports, link conditions, traffic rates and changes of traffic rates
Annotate packets with protocol reports that represent the mixture of protocols seen at various points throughout the network
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
RII
SMTP
DNS
Dist Tier
Relationship between traffic classes, correlations, anomolies Discovering anomalies: iBoxes consuming annotations from other
parts of the network need to be able to discover when good services lose performance
SLT problem of anomaly detection made easier with more information and visibility
Network data stored in vector form for rate, quantity, time domain Discovering correlations: For good services that are degrading,
finding correlations to anomalous traffic surges, flash traffic, etc. provides hints to cause of problem
Each iBox representing affected traffic needs annotations containing network wide events capturing changes in traffic patterns
“Analysis” components of observe-analyze-act done from multiple network vantage points or centralized?
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
RII
SMTP
DNS
Dist Tier
Experimental Intervention, protection of good traffic via policy actions Experimental intervention:
Control annotations sent to iBox near source of surge to temporarily throttle
Annotations routed to iBox at ISP ingress to invoke new policy The policy in the annotation relies on iBox actions of BW shaping, fencing,
and TCP ack manipulation to reduce SMTP flow rate Protection of good traffic:
Policy could include network-level redirection to channel good DNS requests from access networks to a secondary, backup DNS service
Marking traffic not affiliated with surge for protection elsewhere in the network closer to the service location
Client RIC
DNS
Web
DNS
NFS
FTP
Server tier
ISR
RII
SMTP
DNS
Outline
Motivating example: Discovering and protecting network service performance
PNEs as A-Layer building block Overview: Annotation layer as provider of
component building block for network management
Revisit network service example with A-Layer Research challenges, open issues, opportunities
Policy expression and deployment
When correlations discovered, what to do with them?
Initial efforts are to provide observation platform for visualization of network state A-Layer/iBoxes as building blocks for operator
interaction
“Above the network” services
Right now we envision iBoxes understanding well known network services Open question as to visibility to higher level applications
like web services, enterprise-specific apps New policy complexity, new correlations and state
management needed
Statistical visualization for operators Open problem to aggregate distributed
observations into coherent visualization for operators Where does the visualization reside? What are the right metrics/correlations/deviations from
mean that are relevant? How do actions relate to visualization?
SLT analysis
Choice of algorithm Finding “interesting” correlations Not being overloaded with too many correlations
and events Deviation from mean, finding patterns, what is
normal operation for a protocol?
Managing distributed actions
Managing feedback loops Providing coherent actions at the global scale
based on iBoxes distributed throughout the network
Coordinating actions despite network surges and limited network access, path losses, etc.
Q: What about the e2e argument?
Adding/removing annotations: Annotations easy to remove Packet paths not modified
Actions such as throttling, scheduling, dropping Con: affects traffic in ways endhosts can detect Pro: Provides “library” of components to enable new network
services / management features That’s how we build software
A-Layer gives enterprise operators control over their networks As long as their applications are supported and work Enterprise networks usually have white list of allowed apps, all
other disallowed Contrast this to ISPs
Q: What about per-flow state management?
Some routers can keep per-flow state (Netflows) iBoxes can sample traffic iBoxes not in correctness path--can act as ‘nops’ Network traffic parallelizable, targeting 1 GigE Can be merged into expandable network devices
(see Cisco’s server cards that plug into routers)
Q: What about e2e security (IPsec?)
E2e security obscured protocol, but not path stats Conceivable to discover request/response phases, infer
completion rate; keep stats on # connections, flow rates Statistically infer when a flow is starved for bandwidth;
observe bandwidth over time; correlate with destination/sources function (web server, mail server, etc)
Correlations still work over encrypted traffic Can still perform experiments by affecting flow X,
observing flow Y
Q: Why annotate? (Why not send separate packets?)
Annotations are about path characteristics Can bind to the flow they describe Statistics follow paths where they are the most relevant Marries per-path context with each packet of a particular flow
(gives iBoxes info they need to throttle, fence, etc)
As packet flow rate increases, more opportunity for visibility by piggybacking
Lower overhead during times of stress Possible preference of fewer large packets than more small
packets
Explicit sending of separate packets still ok Especially for discovery, control, and policy distribution
Q: Why distributed? Centralized statistics gathering easy in enterprise
networks But hard during times of stress/traffic spikes/flash traffic
Information might be needed in more than one place “Act” operations to protect good traffic needs timely info
Contrast to 5-min avgs common in SNMP
Raises difficulty, though Election protocols, distributed consensus, negative feedback
loops, management of iBoxes
Let’s experiment and see Open research question as to benefit of distributed vs
centralized network observation, analysis, and action/actuation