Network traffic monitoring - The University of Edinburgh · What is network traffic management? •...

Network traffic monitoring and management

Sonia [email protected]

11th November 2010

Lecture outline

• What is network traffic management?• Traffic management applications• Traffic monitoring system design considerations• Overview of traffic monitoring technologies

– SNMP polling– RMON– Cisco NetFlow– sFlow

• Real world examples• Summary

Control theory applied to network management

NetworkController

Sensor

Current network stateUtilizationUsersApplications

+-

ReferenceThresholdsUsage policiesService policies

Measured error Controls

Measured performance

Network monitoringSNMP pollingTraffic monitoring

System performance

What is network traffic management?

• Understanding the use of the network• Understanding the requirements of users• Measuring how well user requirements are met• Making changes to improve the quality of service

experienced by users• Monitoring the effectiveness of the changes• Monitoring network traffic is an effective way to

measure demand and usage

Traffic management applications

• Detecting and resolving congestion• Identifying and correcting performance

problems• Identifying and mitigating security breaches• Planning for future growth and new

applications• Billing for usage

Traffic monitoring system design considerations

Accurate– Quantitative traffic measurements – Measure all types of traffic– Data on how traffic is routed

Timely– Up-to-date view of entire network under all traffic loads– Monitor all of the time

Scalable– Monitor all the devices in the network– Monitor all speeds of links (1G, 10G, 40G…)

Minimal impact on performance– Switch or router CPU utilisation– Network overhead

Low cost embedded implementation– Encourage pervasive deployment– Focus complexity in the central collector

Traffic monitoring system includes measurement, data collection and analysis

Lecture outline




SNMP polling

• RFC1213 ifTable (SNMP OID 1.3.6.1.2.1.2.2) defines counters recording total volume of traffic carried by each interface:– ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts,

ifInDiscards, ifInErrors, ifInUnknownProtos, ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors

• Commonly polled using SNMP GET every 5 minutes– Delta between consecutive values gives value for 5 minute interval– Delta values stored by management entity – Sequence of values used to present trend (eg over day, week, month)

SNMP polling in practice• Good for understanding overall usage• ifTable widely supported by network devices• Does not give any insight into who and why the network is

being used• Measurements are quite coarse and brief spikes will be

missed• Scalability limitations:

– CPU intensive for devices especially those with large numbers of interfaces

– Polling application (eg Cacti, MRTG) can be CPU intensive limiting the number of devices that can be monitored by a single system

– Incurs high network load when polling a large number of devices

RMON – Remote Network MONitoringInformation Base (RFC 1757)

• Developed by the IETF during the early 1990s to standardise network monitoring probes

• Assumed that a single probe would see all the traffic in the network

• Standard defines:– 20 types (groups) of measurements made and

stored by a probe – MIB used to access the data via SNMP polling

RMON implementation in switches

• Onset of switching in mid 1990s dramatically increased number of probes required to monitor traffic (one for each switch port!)• Switch vendors pressured to provide RMON functionality in switches• Most useful RMON functions (eg matrix, hostTopN) require significant resources and were not implemented by switch vendors• Switch vendors commonly implement 4 groups (1, 2, 3, 9) providing very limited capability

Cisco NetFlow

• Originally designed as a way to manage the size of the flow cache used to optimise routing decisions

• Flow cache accumulated on the router for routed traffic• Flow cache can be exported over UDP (push/event based)

– Expired flows (TCP FIN flag)– At regular intervals ( typically flow cache timeout >= 5mins)– When the cache is full

• Example here is NetFlow v5 – the most common implementation

NetFlow v5 in practice• Can provide accurate data on TCP/IP flows• Good for monitoring WAN traffic• Provides TCP/IP v4 data for routed flows only

– Does not monitor L2 traffic or traffic that is switched– Does not monitor IPv6 traffic

• Exported data can be delayed (flow cache time out)• Scalability limitations

– CPU and memory intensive especially with large numbers of connections and high speed (eg 10G links)

– Exported data is bursty and impacts network performance– Not robust under difficult conditions (eg denial of service)

• Router runs out of memory and cannot export data quickly enough• UDP = exported data may be dropped by network• Accuracy affected but error cannot be quantified

• Often requires additional hardware (feature card, memory).

NetFlow variants

• Juniper cflow and J-Flow– Addresses some scalability issues by using sampled

packets to update the flow cache• Huawei NetStream• Cisco NetFlow v9 aka Flexible NetFlow

– Addresses some issues with NetFlow v5 by including different fields in the flow cache (MAC addresses, IPv6)

• Internet Protocol Flow Information Export (IPFIX)– IETF standard derived from NetFlow v9

Internet Protocol Flow Information Export (IPFIX)

• Defines the protocol for information export• Template describing flow cache “keys” defined on device allows more

flexible measurements than NetFlow v5– eg source MAC, destination MAC, ethertype– Template exported periodically in separate control channel – Management entity listens for templates and uses to interpret data– Each vendor must define the templates supported

• Defines sampling mechanisms to improve scalability• IPFIX compliant devices must be able to export data over Stream Control

Transmission Protocol – Addresses reliability issues– Increases implementation complexity and cost

• IPFIX compliant devices must be able to encrypt exported data– Addresses concerns with data privacy– Increases implementation cost

• Not yet widely supported by router or network management vendors

sFlow

• Standard maintained by industry standard’s body, sFlow.org

• Defines measurements and data export• Implemented by most switch vendors and supported by

many network management application vendors

sFlow architecture

Smart Collector• Collects sFlow from all network devices• Scales to monitor the entire network• Performs complex analysis• Alerts on abnormal traffic

Internetall switches/routers, all interfaces,all protocols,all of the time

Simple Agents• 1 in N sampling of packets• Time-based counter sampling• Easy to implement• Embedded, wire-speed• Numerous (every device, every port)

sFlow collector

sFlow sampling algorithms

Exclude Packet?

Wait for Packet

Yes

Assign Destination Interface

Skip = 0?

Decrement Skip Increment Total_Packets

Skip = NextSkip(Rate) Increment Total_Samples

No

Send to Agent: Copy of Sampled Packet Source Interface Destination Interface Total_Samples Total_Packets

Send Packet to Destination Interface

Yes

No

Total_Packets = 0 Total_Samples = 0

Skip = NextSkip(Rate)

sFlow Agent

Packet sampling process

Packet sample(including forwarding decision associated with sampled packet)

Interface counter sample(time-based sampling – eg every 20s)

sFlow Datagram

sFlow exports packet headers

• Don’t expect layer 2 devices to decode the data• Much easier to add decodes to central collector than to every device in

a multi-vendor network (e.g. IPv6, FCoE etc.)• Packet header captures complex layering MAC, VLAN, MPLS, IPv4,

IPv6 that is critical for tracing packet paths through network

sFlow replaces counter polling• sFlow agent automatically pushes full set of SNMP ifTable

counters• Compared to SNMP polling, counter push results in 10-20x

fewer packets on network, reduces CPU load on switch and on network management software– XDR* is easier to encode/decode than ASN.1 used by SNMP– Counter push is not synchronised between devices

• Single sFlow collector can easily monitor 200,000 switch ports with 1 minute granularity. SNMP polling with 5 minute granularity requires 5-10 collectors.

*XDR (RFC 1832) is a standard for describing and encoding data transferred between systems with different architectures

Two types of measurement that are scalable with known accuracy

• Periodic sampling of counters– Counting is fast, hardware supports counting, most

systems count events, transactions, errors etc.• Statistical sampling of packets

– A variant on packet counting, count down to zero, capture the packet, reset the counter with a new random number

• Why are these mechanisms scalable?1. They require minimal, fixed size state (just a block of counters per node). Total state space

grows linearly with number of nodes.2. Very few operations required, easy to implement in hardware, very small impact when

implemented in software3. Asynchronous, easily implemented without synchronization or locking mechanisms on:

multi-port, multi-module, multi-thread, multi-core devices etc• Accuracy

1. Not 100% accurate but sufficiently accurate for many applications including billing2. Sampling accuracy determined by number of samples, not total population

(http://blog.sflow.com/2009/05/scalability-and-accuracy-of-packet.html)

Lecture outline




Real world example:• Outage 24th September 2009 caused by high load on

Contacts Service– Network issue in the data centre– Unusually high load on the Contacts Service– Update to Gmail which also placed a high load on Contacts

Service• Illustrates complex dependencies between networked

components• Monitoring traffic would have identified:

– Network issue in data center– Abnormal connection rate to Contacts Service

• Monitoring enables rapid identification of issues so that mitigating action can be taken promptly

Real world example: CERN• Large Hadron Collider

– High speed switched network used to collect measurements from the experiment and control the experiment

– Sophisticated monitoring of the network is essential for successful operation of the experiments

– CERN uses sFlow because of its scalability

"Because there are so many ports in the core switches, the SNMP query of interface counters takes a long time and occupies a lot CPU and memory resource."

Real world example: CERN Invesitigation of Network Behaviour and Anomaly Detection

(CINBAD)

"Even in CERN 'academic' environment, we can not afford network downtimes, especially when LHC starts to produce peta bytes of data."

"CERN's campus network has more than 50,000 active user devices interconnected by 10,000 km of cables and fibres, with more than 2500 switches and routers. The potential 4.8 Tbps throughput within the network core and 140 Gbps connectivity to external networks offers countless possibilities to different network applications."

"To acquire knowledge about the network status and behaviour, CINBAD collects and analyses data from numerous sources. A naive approach might be to look at all of the packets flying over the CERN network. However, if we did this we would need to analyse even more data than the LHC could generate. The LHC data are only a subset of the total data crossing via these links."

"CINBAD overcomes this issue by applying statistical analysis and using sFlow, a technology for monitoring high-speed switched networks that provides randomly sampled packets from the network traffic."

Summary

• Network traffic monitoring and management manages the quality of service provided by the network

• Critical for the operation of modern networks• Various technologies with different

approaches to addressing the key design focus of scalability

References• RMON

– http://www.rfc-editor.org/rfc/rfc1757.txt• NetFlow

– http://www.cisco.com/en/US/tech/tk812/tsd_technology_support_protocol_home.html• IPFIX

– http://www.ietf.org/rfc/rfc5101.txt– http://datatracker.ietf.org/wg/ipfix/charter/

• sFlow– http://www.sflow.com – http://blog.sflow.com

• XDR– http://tools.ietf.org/rfc/rfc1832.txt

• CERN– http://cdsweb.cern.ch/record/1216160/files/LHCb-CONF-2009-047.pdf– http://cerncourier.com/cws/article/cnl/40379

Network traffic monitoring - The University of Edinburgh · What is network traffic management? •...

Documents

Transcript of Network traffic monitoring - The University of Edinburgh · What is network traffic management? •...