Network traffic monitoring - The University of Edinburgh · What is network traffic management? •...
Transcript of Network traffic monitoring - The University of Edinburgh · What is network traffic management? •...
Lecture outline
• What is network traffic management?• Traffic management applications• Traffic monitoring system design considerations• Overview of traffic monitoring technologies
– SNMP polling– RMON– Cisco NetFlow– sFlow
• Real world examples• Summary
Control theory applied to network management
NetworkController
Sensor
Current network stateUtilizationUsersApplications
+-
ReferenceThresholdsUsage policiesService policies
Measured error Controls
Measured performance
Network monitoringSNMP pollingTraffic monitoring
System performance
What is network traffic management?
• Understanding the use of the network• Understanding the requirements of users• Measuring how well user requirements are met• Making changes to improve the quality of service
experienced by users• Monitoring the effectiveness of the changes• Monitoring network traffic is an effective way to
measure demand and usage
Traffic management applications
• Detecting and resolving congestion• Identifying and correcting performance
problems• Identifying and mitigating security breaches• Planning for future growth and new
applications• Billing for usage
Traffic monitoring system design considerations
Accurate– Quantitative traffic measurements – Measure all types of traffic– Data on how traffic is routed
Timely– Up-to-date view of entire network under all traffic loads– Monitor all of the time
Scalable– Monitor all the devices in the network– Monitor all speeds of links (1G, 10G, 40G…)
Minimal impact on performance– Switch or router CPU utilisation– Network overhead
Low cost embedded implementation– Encourage pervasive deployment– Focus complexity in the central collector
Traffic monitoring system includes measurement, data collection and analysis
Lecture outline
• What is network traffic management?• Traffic management applications• Traffic monitoring system design considerations• Overview of traffic monitoring technologies
– SNMP polling– RMON– Cisco NetFlow– sFlow
• Real world examples• Summary
SNMP polling
• RFC1213 ifTable (SNMP OID 1.3.6.1.2.1.2.2) defines counters recording total volume of traffic carried by each interface:– ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts,
ifInDiscards, ifInErrors, ifInUnknownProtos, ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors
• Commonly polled using SNMP GET every 5 minutes– Delta between consecutive values gives value for 5 minute interval– Delta values stored by management entity – Sequence of values used to present trend (eg over day, week, month)
SNMP polling in practice• Good for understanding overall usage• ifTable widely supported by network devices• Does not give any insight into who and why the network is
being used• Measurements are quite coarse and brief spikes will be
missed• Scalability limitations:
– CPU intensive for devices especially those with large numbers of interfaces
– Polling application (eg Cacti, MRTG) can be CPU intensive limiting the number of devices that can be monitored by a single system
– Incurs high network load when polling a large number of devices
RMON – Remote Network MONitoringInformation Base (RFC 1757)
• Developed by the IETF during the early 1990s to standardise network monitoring probes
• Assumed that a single probe would see all the traffic in the network
• Standard defines:– 20 types (groups) of measurements made and
stored by a probe – MIB used to access the data via SNMP polling
RMON implementation in switches
• Onset of switching in mid 1990s dramatically increased number of probes required to monitor traffic (one for each switch port!)• Switch vendors pressured to provide RMON functionality in switches• Most useful RMON functions (eg matrix, hostTopN) require significant resources and were not implemented by switch vendors• Switch vendors commonly implement 4 groups (1, 2, 3, 9) providing very limited capability
Cisco NetFlow
• Originally designed as a way to manage the size of the flow cache used to optimise routing decisions
• Flow cache accumulated on the router for routed traffic• Flow cache can be exported over UDP (push/event based)
– Expired flows (TCP FIN flag)– At regular intervals ( typically flow cache timeout >= 5mins)– When the cache is full
• Example here is NetFlow v5 – the most common implementation
NetFlow v5 in practice• Can provide accurate data on TCP/IP flows• Good for monitoring WAN traffic• Provides TCP/IP v4 data for routed flows only
– Does not monitor L2 traffic or traffic that is switched– Does not monitor IPv6 traffic
• Exported data can be delayed (flow cache time out)• Scalability limitations
– CPU and memory intensive especially with large numbers of connections and high speed (eg 10G links)
– Exported data is bursty and impacts network performance– Not robust under difficult conditions (eg denial of service)
• Router runs out of memory and cannot export data quickly enough• UDP = exported data may be dropped by network• Accuracy affected but error cannot be quantified
• Often requires additional hardware (feature card, memory).
NetFlow variants
• Juniper cflow and J-Flow– Addresses some scalability issues by using sampled
packets to update the flow cache• Huawei NetStream• Cisco NetFlow v9 aka Flexible NetFlow
– Addresses some issues with NetFlow v5 by including different fields in the flow cache (MAC addresses, IPv6)
• Internet Protocol Flow Information Export (IPFIX)– IETF standard derived from NetFlow v9
Internet Protocol Flow Information Export (IPFIX)
• Defines the protocol for information export• Template describing flow cache “keys” defined on device allows more
flexible measurements than NetFlow v5– eg source MAC, destination MAC, ethertype– Template exported periodically in separate control channel – Management entity listens for templates and uses to interpret data– Each vendor must define the templates supported
• Defines sampling mechanisms to improve scalability• IPFIX compliant devices must be able to export data over Stream Control
Transmission Protocol – Addresses reliability issues– Increases implementation complexity and cost
• IPFIX compliant devices must be able to encrypt exported data– Addresses concerns with data privacy– Increases implementation cost
• Not yet widely supported by router or network management vendors
sFlow
• Standard maintained by industry standard’s body, sFlow.org
• Defines measurements and data export• Implemented by most switch vendors and supported by
many network management application vendors
sFlow architecture
Smart Collector• Collects sFlow from all network devices• Scales to monitor the entire network• Performs complex analysis• Alerts on abnormal traffic
Internetall switches/routers, all interfaces,all protocols,all of the time
Simple Agents• 1 in N sampling of packets• Time-based counter sampling• Easy to implement• Embedded, wire-speed• Numerous (every device, every port)
sFlow collector
sFlow sampling algorithms
Exclude Packet?
Wait for Packet
Yes
Assign Destination Interface
Skip = 0?
Decrement Skip Increment Total_Packets
Skip = NextSkip(Rate) Increment Total_Samples
No
Send to Agent: Copy of Sampled Packet Source Interface Destination Interface Total_Samples Total_Packets
Send Packet to Destination Interface
Yes
No
Total_Packets = 0 Total_Samples = 0
Skip = NextSkip(Rate)
sFlow Agent
Packet sampling process
Packet sample(including forwarding decision associated with sampled packet)
Interface counter sample(time-based sampling – eg every 20s)
sFlow Datagram
sFlow exports packet headers
• Don’t expect layer 2 devices to decode the data• Much easier to add decodes to central collector than to every device in
a multi-vendor network (e.g. IPv6, FCoE etc.)• Packet header captures complex layering MAC, VLAN, MPLS, IPv4,
IPv6 that is critical for tracing packet paths through network
sFlow replaces counter polling• sFlow agent automatically pushes full set of SNMP ifTable
counters• Compared to SNMP polling, counter push results in 10-20x
fewer packets on network, reduces CPU load on switch and on network management software– XDR* is easier to encode/decode than ASN.1 used by SNMP– Counter push is not synchronised between devices
• Single sFlow collector can easily monitor 200,000 switch ports with 1 minute granularity. SNMP polling with 5 minute granularity requires 5-10 collectors.
*XDR (RFC 1832) is a standard for describing and encoding data transferred between systems with different architectures
Two types of measurement that are scalable with known accuracy
• Periodic sampling of counters– Counting is fast, hardware supports counting, most
systems count events, transactions, errors etc.• Statistical sampling of packets
– A variant on packet counting, count down to zero, capture the packet, reset the counter with a new random number
• Why are these mechanisms scalable?1. They require minimal, fixed size state (just a block of counters per node). Total state space
grows linearly with number of nodes.2. Very few operations required, easy to implement in hardware, very small impact when
implemented in software3. Asynchronous, easily implemented without synchronization or locking mechanisms on:
multi-port, multi-module, multi-thread, multi-core devices etc• Accuracy
1. Not 100% accurate but sufficiently accurate for many applications including billing2. Sampling accuracy determined by number of samples, not total population
(http://blog.sflow.com/2009/05/scalability-and-accuracy-of-packet.html)
Lecture outline
• What is network traffic management?• Traffic management applications• Traffic monitoring system design considerations• Overview of traffic monitoring technologies
– SNMP polling– RMON– Cisco NetFlow– sFlow
• Real world examples• Summary
Real world example:• Outage 24th September 2009 caused by high load on
Contacts Service– Network issue in the data centre– Unusually high load on the Contacts Service– Update to Gmail which also placed a high load on Contacts
Service• Illustrates complex dependencies between networked
components• Monitoring traffic would have identified:
– Network issue in data center– Abnormal connection rate to Contacts Service
• Monitoring enables rapid identification of issues so that mitigating action can be taken promptly
Real world example: CERN• Large Hadron Collider
– High speed switched network used to collect measurements from the experiment and control the experiment
– Sophisticated monitoring of the network is essential for successful operation of the experiments
– CERN uses sFlow because of its scalability
"Because there are so many ports in the core switches, the SNMP query of interface counters takes a long time and occupies a lot CPU and memory resource."
Real world example: CERN Invesitigation of Network Behaviour and Anomaly Detection
(CINBAD)
"Even in CERN 'academic' environment, we can not afford network downtimes, especially when LHC starts to produce peta bytes of data."
"CERN's campus network has more than 50,000 active user devices interconnected by 10,000 km of cables and fibres, with more than 2500 switches and routers. The potential 4.8 Tbps throughput within the network core and 140 Gbps connectivity to external networks offers countless possibilities to different network applications."
"To acquire knowledge about the network status and behaviour, CINBAD collects and analyses data from numerous sources. A naive approach might be to look at all of the packets flying over the CERN network. However, if we did this we would need to analyse even more data than the LHC could generate. The LHC data are only a subset of the total data crossing via these links."
"CINBAD overcomes this issue by applying statistical analysis and using sFlow, a technology for monitoring high-speed switched networks that provides randomly sampled packets from the network traffic."
Summary
• Network traffic monitoring and management manages the quality of service provided by the network
• Critical for the operation of modern networks• Various technologies with different
approaches to addressing the key design focus of scalability
References• RMON
– http://www.rfc-editor.org/rfc/rfc1757.txt• NetFlow
– http://www.cisco.com/en/US/tech/tk812/tsd_technology_support_protocol_home.html• IPFIX
– http://www.ietf.org/rfc/rfc5101.txt– http://datatracker.ietf.org/wg/ipfix/charter/
• sFlow– http://www.sflow.com – http://blog.sflow.com
• XDR– http://tools.ietf.org/rfc/rfc1832.txt
• CERN– http://cdsweb.cern.ch/record/1216160/files/LHCb-CONF-2009-047.pdf– http://cerncourier.com/cws/article/cnl/40379