Network Monitoring and Data Center Operation
-
Upload
datacenters -
Category
Technology
-
view
782 -
download
9
Transcript of Network Monitoring and Data Center Operation
1
Network Monitoring andData Center Operation
KDDI/APAN-JP/JGNⅡ
Self-Introduction
Jin Tanaka [email protected]
KDDI Japanese Telecommunication CarrierOtemachi Technical Center
Network OperatorWorked as network engineer of Commercial ISP for 2 yearsCurrently working as network operator of Research, Development &Education networks
APAN-JP NOC (AS7660)http://www.apan.net : APANhttp://www.jp.apan.net : APAN-JP pagehttp://www.jp.apan.net/noc : NOC page
JGN2 international NOChttp://www.jgn.nict.go.jp/ : JGN2
2
Agenda
1. Basic Knowledge of Network Monitoring
2. Network Monitoring Tools3. Advanced Tools for Measuring
Network Performance4. Data Center Operation 5. Discussion & Question
Basic Knowledge of Network Monitoring
3
Why Network Monitoring is necessary?
Reliability of network is considered to be more and more important
Lifeline, Business, etc. Mission Critical
Occurrence of trouble is inevitable on any network
With current IP technology, it is difficult to make a network without trouble!
In order to shorten unavailable timeDetect trouble at an early stageComplete trouble-shoot quickly
In order to grasp the situation of networkAvailability, Performance, Routing
What is Monitoring 1Basic way of Monitoring
Classified into three monitoring waysIn Internal Network (mostly) Via External NetworkNon-network (Emergency case) 1, Monitoring in internal
Network (mostly)
2, Monitoring via ExternalNetwork - via Peering Network- via the Internet
3, Independent access(Emergency case)- ISDN, PSTN
Internal network
External network
Monitoring Machine
4
What is Monitoring 2Scheme of Monitoring
1. Determine the monitoring target (What is the target for monitoring?)
2. Set up the monitoring node- Ping to target, SNMP polling
3. Establish the threshold- Ping / polling interval, SNMP MIB
4. Threshold exceeded
5. Notice the alert- Sound, mail, pop-up
Monitoring is realized in a repetition of the above flow.Trouble-shooting is started when the notice(5) is judged to be trouble!
Determination of Monitoring Target
Select target which is suitable for checking normality of network service What is the target for monitoring?
RouterDead or Alive? Status? Performance? Routing?
ServerDead or Alive? Status? Damon? Service Port?
Traffic, etc. Increase or decrease? Dos Attack? Performance? Environment?
5
Monitoring Method 1
Examine how to monitor the target Active monitor or Passive monitor
Polling = Monitoring machines give message in watching target
Useful for checking the current statusICMP/SNMP polling…
Receive trap message from targetUseful for detecting the status change
SNMP trap, syslog…Statistics data
Useful for grasping the trend and transitionSelect the Monitoring Tool
Ping (ICMP), SNMP, Monitoring Tool, Original Tool, etc.Check the monitoring Route to Target
Internal or External network
Monitoring Method 2
Examine the frequency of monitoring Monitoring the target on a case-by-case basis or regular basis
Is it necessary to monitor regularly using monitoring tool or system?
Critical target in providing network service Statistics data useful to trouble-shooting
Determination of monitoring interval5/15/30・・ minutes,・・・1/8/24・・・hours,・・・
Establish the threshold for alert Required for generating alert by quantitative change
The best monitoring method is realized in environment similar to actual service condition !
6
Notification of Alert
How to notify alertSelection of suitable alert notification function isindispensable
GraphicalPop up message, flashing icon on display
MailFor checking the condition, sent regularlySent only when there is state change
SoundAlert has no meaning if operators do not notice thenetwork trouble !
Network Monitoring Tools
7
- ICMP/Ping Polling 1 -
Check IP reachability by ICMP echo/replyAdditional information
RTT (Round Trip Time)Packet LossTTL (Time to Live)
Most standard way of checking node activityTime series RTT/Packet loss data becomes important information when measuring link performance
ICMP echo
ICMP echo reply
RTT: xx msecPacket Loss: xx %
TTL: xx
- ICMP/Ping Polling 2 -
Optional Parameter In case of daily operation
Packet size (byte)Sending interval (sec)Sending count (n)Timeout (sec)TTL (n)Pattern (0x????)etc.
At Monitoring systemSending interval Sending countTime out
Set up the value which is adapted for critical level or service level!
8
UDP/TCP polling
Effective in monitoring service ports of serverUsing client for service
DNS - nslookup
Using telnetWWW,SMTP,POP
Using toolRadius - radping
Telnet with service port
reply
bash-2.05$ telnet ns.jp.apan.net 80Trying 203.181.248.3...Connected to ns.jp.apan.net.Escape character is '^]'.get<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>501 Method Not Implemented</title>
:
SNMP - Framework -
SNMP: Simple Network Management ProtocolPolling: UDP 161, Trap : UDP 162 Protocol for monitoring/managing equipment via networkEnables us to monitor the state and traffic of various equipment without being dependent on venderManagement is realized by UDP between…
monitoring/managing server : Managere.g. HP Open View, Sun NNM
network equipment : Agent resides in devicee.g. Unix daemon、Cisco IOS
Most general technique for acquiring detailed information from a router or a switch
9
SNMP - Version -
SNMP v1 RFC1157When Manager requests, Agent returns responseAgent sends trap when specific event has occurred
SNMP v2 RFC1902Basic of features are almost the same as those of v1Additional regulation
64bit counter : can deal with large numerical value Get-bulk request : used to efficiently retrieve large blocks of dataSupports the use of encryption of messages
SNMP v3 RFC2271~2275Additional regulation
Various security function : MD5 user authentication, DES encryptionDynamically configure the SNMP Agent using SNMP SET commands
SNMP - MIB & OID -
SNMP Manager can acquire the management information defined by MIB(Management Information Base) from Agent
Current version : MIBv2 RFC 1213MIB is the aggregate of object (information) on the equipment which SNMP Agent holdsIdentifier is defined for each object = OIDMIB performed by Agent is roughly divided into:
MIBv2 : standard, public, specified by IETFEnterprise MIB : private, specified by vendor company
10
SNMP - MIB Tree -Objects are managed by the tree Expressed in a row of values divided by the period
root
iso(1)ccitt(0) Joint-iso-ccitt(2)
org(3)
dod(6)
Internet(1)
directory(1) mgmt(2) exprimental(3) private(4)
mib(1) enterprise(1)
Standard MIBs Vendor-specific MIBs
SNMP - OID -OID Expression
iso(1). org(3). dod(6). internet(1). mgmt(2). mib2(1)-> .1.3.6.1.2.1e.g. sysDscr = .1.3.6.1.2.1.1.1 = mib-2.1.1 = system.1
Measures the performance of the underlying SNMP implementation on the managed entity and tracks things such as the number of SNMP packets sent and received. 1.3.6.1.2.1.11snmp
There are currently no objects defined for this group, but other media-specific MIBs are defined using this subtree.1.3.6.1.2.1.10transmission
Tracks various statistics about EGP and keeps an EGP neighbor table.1.3.6.1.2.1.8egp
Tracks UDP statistics, datagrams in and out, etc.1.3.6.1.2.1.7udp
Tracks, among other things, the state of the TCP connection (e.g., closed, listen, synSent, etc.).1.3.6.1.2.1.6tcp
Tracks things such as ICMP errors, discards, etc.1.3.6.1.2.1.5icmp
Keeps track of many aspects of IP, including IP routing.1.3.6.1.2.1.4ip
The address translation (at) group is deprecated and is provided only for backward compatibility. It will probably be dropped from MIB-III.1.3.6.1.2.1.3 at
Keeps track of the status of each interface on a managed entity. The interfaces group monitors which interfaces are up or down and tracks such things as octets sent and received, errors and discards, etc.1.3.6.1.2.1.2interfaces
Defines a list of objects that pertain to system operation, such as the system uptime, system contact, and system name.1.3.6.1.2.1.1system
DescriptionOIDSubtreeName
11
SNMP - SNMP Message -SNMP version : Check the version of SNMP(0 is for version 1)Community : Password between Manager and AgentPDU (Protocol Data Unit) : Actual command
Manager -> AgentGetRequest
Used to request the values of one or more MIB variablesGetNextRequest
Used to read the values of variables in the MIB sequentially. It is often used to read through a table of values. After reading the Getrequest,GetNextRequest are used to read through the remaining rows
SetRequestUsed to update one of the MIB values
Agent -> ManagerGetResponse
Returned as answer to GetRequest or GetNextRequest message
TrapUsed to notify significant events (e.g. a cold or a warm restart…)
SNMP - SNMP Message Handling 1 -
SNMP Manager SNMP Agent
GetRequest (What is the value of MIB?)
GetResponse (The value is XXXX!)
GetNextRequest(What is the next value of MIB Tree ?)
GetResponse (The value is XXXX!)
GetResponse (The value is XXXX!)
SetRequest (Modify the value of OID)
Trap (Problem happened!)
12
SNMP - SNMP Message Handling 2 -
Command examples
GetRequestinetapan@tools:~> snmpget -v2c -c xxxx tpr2.jp.apan.net .1.3.6.1.2.1.2.2.1.4.136IF-MIB::ifMtu.136 = INTEGER: 9192
GetNextRequestinetapan@tools:~> snmpget -v2c -c xxxx tpr2.jp.apan.net systemSNMPv2-MIB::system = No Such Object available on this agent at this OIDinetapan@tools:~> snmpwalk -v2c -c xxxx tpr2.jp.apan.net systemSNMPv2-MIB::sysDescr.0 = STRING: m20 internet router, kernel 6.2R3.10SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.2636.1.1.1.2.2DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (423280751) 48 days, 23:46:47.51SNMPv2-MIB::sysContact.0 = STRING:SNMPv2-MIB::sysName.0 = STRING: tpr2SNMPv2-MIB::sysLocation.0 = STRING:SNMPv2-MIB::sysServices.0 = INTEGER: 4
SetRequestinetapan@tools:~> snmpset ‒v2c ‒c xxxx tppr.jp.apan.net system.sysLocation.0 system.sysLocation.0 = "" inetapan@tools:~> snmpset ‒v2c ‒c yyyy tppr.jp.apan.net system.sysLocation.0 s “Tokyo, JP“system.sysLocation.0 = “Tokyo, JP" inetapan@tools:~> snmpset ‒v2c ‒c xxxx tppr.jp.apan.net system.sysLocation.0
system.sysLocation.0 = “Tokyo, JP"
SNMP - Trap Message -The way for Agent to inform Manager about event of something undesirableTrap originates from Agent and is sent to the trap destination, as configured within Agent itself When Manager receives a trap, it needs to know how to interpret it PDU
Enterprisevendor identification (OID) for the agent
AgentAddressThe IP address of the node where the trap was generated.
Trap TypeGeneric / Specific (not used)
TimestampThe length of time between the last re-initialization of the agent that issued a trap and the moment at which the trap was issued
13
Monitoring Software - HP OpenView -
HP OpenView Network Node Manager ®http://www.openview.hp.com/products/nnm/index.html
OverviewAuto discovery and mappingDrill-down views (Hierarchy Map) Fault monitoring : ICMP / SNMP pollingEvent monitoring : Trap receiving/Event configuration SNMP tools : Status pollingMIB BrowserWeb-based reportsExtended software is enhanced Platform : Windows 2000/XP, Solaris 8/9, HP-UX
APAN-JP NOC monitors its network using OpenView mainly!
Monitoring Software - HP OpenView Sample 1-
OpenView Contracture
Event log
ICMP polling for connectivity check
Network map
Router map Network sub-map
14
Monitoring Software - HP OpenView Sample 2-
OpenView Tools
Snmp configuration for polling- parameters- community
Event configuration
Data collection & Thresholds for SNMP
Monitoring Software - Nagios Overview-
Nagios ®Freely available from http://www.nagios.org
OverviewA host and service monitor designed to inform you of network and end system problemsProvides simple ping availability of resources on the networkWorks with a set of “plugins” to provide local and remote host service statusCustom “plugins” are relatively easy to developWeb-based monitoring systemPlatform : Linux, UNIX
APAN-JP NOC uses Nagios as secondary monitoring system!
15
Monitoring Software - Nagios Sample 1-
NagiosService Overview For All Host Groups
Service Status Details For All Hosts
Monitoring Software - Nagios Sample 2-
NagiosNetwork Map For All Hosts
Event log
16
MRTG (Multi-Router Traffic Grapher)
Overview Monitors the load of network equipment using SNMP, mainly used for creation of traffic graphExcellent graphing tool developed by Tobias OetikerPlots graph with any two variables against time, It is graph-izedwith PNG format on HTML pageAble to create scripts to feed data into MRTGImplements data collection, image, web-page collectionVery widely deployed in large networks and still being actively developedPlatform : UNIX system / Windows NTSupports SNMPv2 : able to read 64bit countershttp://people.ee.ethz.ch/~oetiker/webtools/mrtg/
MRTG - Workflow -Display of graph
Green area typically represents incoming maximum bits per secondBlue line typically represents outgoing maximum bits per second
Workflow1.Read configuration file2.Collect graphing data from network equipment, based on
configuration3.Update database file and generate graph4.If required, generate HTML file
MRTG performs above workflow then completesSince MRTG collects data of the past 5 minutes (default value of source code), it is desirable to set “crontab” for every 5 minutes
17
MRTG - Data Storage -
Daily grafh/5min
Weekly grafh/30min
Monthly grafh/2hours
Yearly grafh/1day
Data StorageKeeps 5 minute data only for 2.5 days. The data is thrown away afterward.
There is no referring to historical data with high resolution Keeps 1-day data for approx. 2 years
RougherResolution
daily2.5 days6005 minutes
Yearly2 years7311 day
Monthly50 days6002 hours
Weekly12.5 days60030 minutes
GraphStorage period
Num of record
Interval
MRTG - Configuration 1 -
MRTG Configurationcfgmaker
Helps to create configuration file formExample
cfgmaker -global ‘WorkDir: /home/httpd/html/mrtg’ ¥-global "Options[_]: bits,growright’ ¥-output /home/httpd/html/mrtg/cfg/mrtg.cfg ¥n
Graph & log data: /home/httpd/html/mrtgConfiguration file: /home/https/html/cfg/mrtg.cfgOption : unit = bits(bps), Horizontal axis = grow right way
Detailed informationhttp://people.ee.ethz.ch/~oetiker/webtools/mrtg/cfgmaker.html
18
MRTG - Configuration 2 -Target Configuration
Target ExpressionTarget[<target name>]:<target kind>:<community>@<address>
<target name> : Identify equipment <target kind> : Measurement item<community> : SNMP community string <address> : Hostname or IP address of equipment
SNMP data collection specification methodBasic / Port (ifindex)
Target[myrouter]: 2:[email protected]
Explicit OIDs / MIB Variables Target[myrouter]: 1.3.6.1.2.1.2.2.1.14.1&1.3.6.1.2.1.2.2.1.20.1:public@myrouter
Target[myrouter]: ifInErrors.1&ifOutErrors.1:public@myrouter
You can use cfgmaker to generate references with the options-- ifref=?ifref=ip: Interface by IP ifref=descrf: Interface by Descriptionifref=name: Interface by Nameifref=eth: Interface by Ethernet Address
MRTG - Configuration 3 -
Example of ConfigurationTarget[la]: ifHCInOctets¥so-2/0/0&ifHCOutOctets¥so-2/0/0:[email protected]:::::2MaxBytes[la]: 300000000Title[la]: Traffic Analysis of TransPAC LA LinkPageTop[la]: <H1>Traffic Analysis of TransPAC LA link</H1>WithPeak[la]: ymwDirectory[la]: tpr2Options[la]: bits, growright
Target[la-err]: ifInErrors¥so-2/0/0&ifOutErrors¥so-2/0/0:[email protected][la-err]: 300000000Title[la-err]: Packet Error for TransPAC LA linkPageTop[la-err]: <H1>Packet Error for TransPAC LA link</H1>Directory[la-err]: tpr2Options[la-err]: growright, integer, nopercentYLegend[la-err]: Number of Error PacketsShortLegend[la-err]: nLegend1[la-err]: Number of Error Packets for Incoming TrafficLegend2[la-err]: Number of Error Packets for Outgoing TrafficLegend3[la-err]: Peak of Number of Error Packets for Incoming TrafficLegend4[la-err]: Peak of Number of Error Packets for Outgoing TrafficLegendI[la-err]: In:LegendO[la-err]: Out:WithPeak[la-err]: w
19
MRTG - Comments -
Comments / Disadvantages If you are to monitor a lot of devices (1000s), it is better to have a fast diskIf using external monitoring scripts, a fast processor and a lot of memory is necessaryNot particularly fast when compared to other data retrieval and storage schemes (Flat text files can slow down processing.)MRTG can’t customize graphing periodsFlat text files are difficult to process when scripting against the dataUse 64bit counters with SNMPv2 for OC3-OC192 speed interface, GbE if it is 115Mbps traffic can wrap 32bit counters around in 5 minutes MRTG can’t modify collected data which is summarized Only two variables are available in processing a graph
RRDtool (Round Robin Database Tool)Overview
Successor to MRTG Developed by the same developer of MRTG : Tobias OetikerTool group for RRD can flexibly define data item, time interval, data amount, graph depiction, etc.Binary file format that can store data at any interval for any length of time
File does not grow in size over timeAbility to make custom graphs across user-defined intervals
Ability to graph multiple variables on a single graphAdditional scripts are necessary in creating graphs and web-page
25-30 percent faster than MRTGDoes not have the function to collect datahttp://people.ee.ethz.ch/~oetiker/webtools/rrdtool/
20
RRDtool - Architecture -Comparison of architecture between MRTG and RRD
router
router
server
text
SNMPengine
FrontendProgram
FrontendProgram
Graph
Index
Graph
Index
RRD
log
i ll
i ll
RRDtool - Basic Usage -
Basic usage of RRD toolsSet up new Round Robin Database (RRD) ・・・①
Define RRD used as vessel of dataCommand : rrdtool create filename
Store new set of values into RRD periodically ・・・②Write the data collected by frontend program in RRDCommand : rrdtool update filename
Generate Graph ・・・③Create graph from data stored in one or several RRDsCommand : rrdtool graph filename (specify the graph name to generate)
RRD
data data Graphdata・・・・・
①
②③
21
RRDtool - Practice -
ExampleObject
Gigabit Ethernet SwitchDefinition
Definition of RRD record
Ability to describe peak graph from data of 1-day to 10-years
Yearly2 years7311 day
4 hours6 hours3601 minute
10 years10 years9154 days
Monthly50 days6002 hours
Daily 2 days5765 minutes
GraphStoragePeriod
Num of RRD file
Interval
RRDtool - Create -
Set up a new Round Robin Database (RRD)
DS : Define the data itemCOUNTER: continuous increasing counters 60 : if no new data is supplied for more than 60
sec, it is considered as “unknown”0 : minimum acceptable value (byte)125000000 : maximum acceptable value (byte)
RRA (Round Robin Archive) : Define the data consolidations
AVARAGE/MAX: average /maximum of consolidated of data0.5 : consolidation interval is be made up from *UNKNOWN* data while the consolidated value is still regarded as known.
- Average 50%. MAX 20% or 10%1: consolidated data point where the data then goes into the archive 360 : how many generations of data values are kept in RRA
Command Example
/usr/local/rrdtool-1.0.46/bin/rrdtool create ¥/home/httpd/html/traffic/traffic_vlan.rrf ¥–step 60 ¥DS:vlan2in:counter60:0:125000000 ¥DS:vlan2out:counter60:0:125000000 ¥DS:vlan7in:counter60:0:125000000 ¥DS:vlan7out:counter60:0:125000000 ¥
:RRA:AVERAGE:0.5:1:360 ¥RRA:AVERAGE:0.5:5:576 ¥RRA:AVERAGE:0.5:120:600 ¥RRA:AVERAGE:0.5:1440:731 ¥RRA:AVERAGE:0.5:5760:915 ¥RRA:MAX:0.2:5:576 ¥RRA:MAX:0.1:120:600 ¥RRA:MAX:0.1:440:731 ¥RRA:MAX:0.1:5760:915 ¥
22
RRDtool - Update -
Stores a new set of values into RRD periodicallyData collection
Collect the data from targets using frontend program Original tool Cricket - http://cricket.sourceforge.net/Orca - http://www.orcaware.com/orca/SNAPP - http://sourceforge.net/projects/snapp/
Updating an RRDFeed collected data into a RRD database using following commands
Command Examplerrdtool update /home/httpd/html/traffic/traffic_vlan.rrd ¥--template in:out N:11222:1
‘N’=Update time is set to be the current time DS1: DS2The data sources are defined in the RRD
The name of the RRD you want to update.
RRDtool - Graph 1 -
Generating Graph -1-Command Examplerrdtool graph /home/httpd/html/traffic/traffic.png -s -4h –w 800 –h 800 –a PNG ¥–t “VLAN Traffic” –v “bit/s” ¥DEF:vlan2in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan2in:AVERAGE ¥DEF:vlan2out_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan2out:AVERAGE ¥DEF:vlan7in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan7out:AVERAGE ¥DEF:vlan7in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan7out:AVERAGE ¥CDEF:vlan2in_ave_bit=vlan2in_ave,8 * ¥CDEF:vlan7in_ave_bit=vlan7in_ave,8 * ¥CDEF:vlan2out_ave_bit=vlan2out_ave,-8 * ¥CDEF:vlan7out_ave_bit=vlan7out_ave,-8 * ¥AREA:vlan2in_ave_bit#ff5e5e:VLAN2-in ¥STACK:vlan7in_ave_bit#5eff5e:VLAN7-in ¥AREA:vlan2out_ave_bit#aa0101:VLAN2-out ¥STACK:vlan7out_ave_bit#0101aa:VLAN7-out ¥
Options-s: start time (default : seconds), -e: end seconds (default : seconds), -w,h : width and height pixels, -a : image format GIF|PNG, -t : Graph title,-v vertical-label text
23
RRDtool - Graph 2 -Generating a Graph -2-
DEF Define virtual name for data source
DEF:<vname>=<RRDfilename>:<DS-name>:CFCF: consolidation function
select AVARAGE, MAX, MIN, LAST ( Newest data)
CDEFCreate new virtual data source by evaluating mathematical expression
CDEF:<vname>=rpn-expression (Reverse Polish Notation)
Graph depiction parameter<Style>:<vname>#<color>:<legend>
LINE : Plot for the request data, using the color specifiedAREA : Area between 0 line and the graph line will be filled with the color specified STACK : Graph gets stacked on top of the previous LINE, AREA, or STACK graph
By updating graph generation periodically using “crontab”, you can see updated graphs on the Web
RRDtool - Sample -
Sample Graph
http://mrtg.jp.apan.net/cricket/router-interfaces/
24
Advanced Tools for MeasuringNetwork Performance
Iperf - Overview -
Iperf is used to measure TCP and UDP bandwidth performance
Tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristicsAble to "memory to memory“ transfer to remove disk IO influencing the resultsClient and server can have multiple simultaneous connections Supporting IPv6 , Platform : UNIX systems / Windows / Mac OSEffective in investigating circuit quality, when a new circuit is establishedhttp://dast.nlanr.net/Projects/Iperf/
25
Iperf - Mode -
Client Server
TCP modeMeasure bandwidthReports MSS (Maximum Segment Size)/MTU (Maximum Transfer Unit)
size and observed read sizesSupports TCP window size via socket buffers
UDP mode Client can create UDP streams of specified bandwidthMeasure packet loss, delay, jitter
Since traffic is generated actually, it must be careful to operate Iperf!
Iperf - Example -Test result Example
Traffic
test% iperf -u -i1 -s------------------------------------------------------------Server listening on UDP port 5001Receiving 1470 byte datagramsUDP buffer size: 1.00 MByte (default)------------------------------------------------------------[ 3] local 203.181.249.xxx port 5xxx connected with 203.181.248.xx port 32781[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams[ 4] 0.0- 1.0 sec 61.0 MBytes 511 Mbits/sec 0.006 ms 0/43492 (0%)[ 4] 1.0- 2.0 sec 61.0 MBytes 511 Mbits/sec 0.005 ms 0/43479 (0%)[ 4] 2.0- 3.0 sec 61.0 MBytes 511 Mbits/sec 0.005 ms 1/43478 (0.0023%)[ 4] 3.0- 4.0 sec 61.0 MBytes 511 Mbits/sec 0.007 ms -1/43478 (-0.0023%)[ 4] 4.0- 5.0 sec 61.0 MBytes 511 Mbits/sec 0.004 ms 0/43478 (0%)
26
BWCTL (Bandwidth Control)
BWCTL is a resource allocation and scheduling daemon for arbitration of iperf testsBWCTL client application works by contacting a bwctld process on both endpoints of test systemsRequires that NTP be running to synchronize the system clock Open mode : everyone can useAuthentication mode : need to exchange AES key Support IPv6, Platform : UNIX systems Developed by Internet2 http://e2epi.internet2.edu/bwctl/
Users attempting to run bandwidth tests used to be not certain whether or not their test was scheduled in a time frame where other tests were not to run
OWAMP (One-way Active Measurement Protocol)
OWAMP is a command line client application and a policy daemon used to determine one way latencies between hostsIt is possible to collect active measurement data
- e.g., one-way delay, packet loss, jitterNTP must be setup correctly on the system to calculate a reasonable estimate of time error and to stabilize clockSupport IPv6. Platform : UNIX systems Current Draft : draft-ietf-ippm-owdp-10.txtDeveloped by Internet2 http://e2epi.internet2.edu/owamp/
Roundtrip-based measurement can not identify the delay in each direction, especially when asymmetric routes are used
icmp ping : RTT owping : one-way
27
OWAMP - Protocol -Consists of two inter-related protocols
OWAMP-ControlUsed to initiate, start/stop test sessions, and fetch test results
OWAMP-TestDefine the format of probe packet
Sample measurement datahttp://pe2.koganei.wide.ad.jp/cgi-bin/owd-stathttp://qpe.jp.apan.net/cgi-bin/owd-stat
Netflow - Overview -Overview
Enables IP traffic flow analysis without probesInvented and patented by Cisco
Juniper (called cflowd), Foundry, ・・・ many venders are supporting
Flow cash data on routers is exportedto a flow tool, so that traffic flow is to be analyzed
flow Definition: Source IP addressDestination IP addressSource portDestination portLayer 3 protocol typeTOS byte (DSCP)Input logical interface
(ifIndex)
Core Network
Enable NetFlow Traffic
Collector(Solaris, HP-UX, or Linux)
UDP NetFlowExport
PacketsApplication GUI
28
Netflow - Flow Data -Flow data export
Enable NetFlow on the routerThere is difference in architecture between Cisco and Juniper routersTake care! the load of a router does not become high! - Check CPU, memory, bandwidth, sampling rate
Flow data collection & AnalysisPrepare the software for receiving flow-export data
flow-tools http://www.splintered.net/sw/flow-tools/cflowd http://www.caida.org/tools/measurement/cflowd/Cisco : NetflowCollector
Analyze traffic from raw data with softwareflow-scan http://net.doit.wisc.edu/~plonka/FlowScan/(If you want to graph-ize analysis data, I recommend you to use RRDtool)Cisco : CiscoWorks
Source and destination IP addressSource and destination TCP/UDP portsPacket and byte countsRouting information (next-hop address, source autonomous system (AS) number, destination AS number, source prefix mask, destination prefix mask)
Netflow - Example -Netflow Example
29
Observatory - Overview -
Observatory Project Abilene Observatory http://abilene.internet2.edu/observatory/
Abilene backbone http://abilene.internet2.edu/
APAN Observatory http://www.jp.apan.net/NOC/Observatory/
System which collects network performance data at backboneCollected data can be used for operation and researchAPAN is preparing three types data collected and shared publiclyacross JP-US link
Latency data --- using OWAMPNetflow data --- using Netflow (Juniper, Procket & flow-tool) Throughput data --- Iperf (BWCTL)
Average RTT 190ms
APAN TokyoXP
Chicago/Indianapolis
Los Angeles
Observatory – Formation -
Observatory system will help grow up R&D networks!
Observatory•Developed by NOC researchers & maintained by NOC engineers/operators
•Common Tools with High Priority
•Output data in standard format
Data with authentication
Software,Papers, etc.
ResearchersResearchers
Researchers can get measurement data of the global networks and collaborate with foreign researchers.
NOC Advanced Service
NOC Basic ServiceHelpful in operation
30
Introduction of other advanced tools
Abilene Router Proxy - Overview -
Similar to Looking Glass, but with some advanced functionsWeb-form allows users to submit various commands to backbone routersAllows remote network operators to troubleshoot problems without contacting NOCUnix-basedUses scripted telnet to login to the routers and grab the outputNot designed for high-speed access to backbone informationVery useful operation tool among inter-domain networkEnable us to view operational situation of almost all Abilene routers
http://ratt.uits.iu.edu/routerproxy/abilene/
Introduction of other advanced tools
Abilene Router Proxy - Sample -
31
SummarySummaryFor summary, below is the table showing ranks of each tool according to four core criteria
( 1 : lowest – 5 : highest)
5543Iperf (BWCTL)
5532OWANP
4543NetFlow
2543Openview
Accessible Useful for operation
Useful for trouble-shoot Low Cost
Nagios 4 2 4 5
MRTG 4 5 4 5
RRD 3 4 5 5
Router Proxy 4 5 5 5
Data Center Operation
32
Data Center operation- Service -
1. Circuit serviceLeased circuit, ATM/SONET/Ethernet/VPN
2. Housing/co-location serviceRack co-location, open co-locationSecurity consideration
Security camera, Security entry system
3. Site Management serviceBasic service
Check entering/leaving, check the power supply & air condition, check equipment lamp, power off/on equipment
Monitoring servicePing monitoring, service port monitor, log-monitor, etc.
Report serviceTraffic graph report (MRTG), resource report, etc.
Alternate processing of routine workTape change, stated equipment re-boot, etc
Assistance in trouble-shooting Technical support over telephone, detection of a trouble points, etc
4.Professional management & operationOutsourcing of network operationBusiness solution
Flexible operation to best meet user’s requirements and characteristics of user network
Routing : IGP/EGP, Multicast, IPv6, etc.Cover almost all layers (Layer 1,2,3,4)Server maintenance : DNS, Web, Mail, etc.Negotiations with external networkManagement of network resource : IP address, VLAN, Rack spaceMonitoring SecurityNetwork consulting Face-to-face communication
Data Center operation- Service 2 -
33
Data Center operationModel
User Network
Data Center
the Internet
IX / the Internet
NOC
entering/leaving
power supply
monitoring negotiation/cooperation
routing/traffic tuning
resource management Internal
External
KDDI NetworkExternal NOC
security
Location:NOC is located at KDDI Otemachi Bldg 12F in Tokyo, with equipments installed on the 5F of the same bldg.
Staff:24×7 Operators standbyOperators are also in charged of operations for othernetworks
Scientific, Academic, Commercial ISP
Duties:Opening and closing of Trouble TicketsReceiving problem reportsTrouble-shootingDevelopment and maintenance of measurement and operation tools
Professional management & operationAPAN & JGN2
34
KDDICircuit Division
Operation StaffOperation Staff
Open ViewNNM
Mail & Web Client
PhysicalLayer Monitor
KDDIAPANKDDIAPAN
ハブ
ハブ
ハブ
12F
5F
APAN Equipment
HP Open View works independently in the NOC segmentNOC staff is utilizing Mail & Web clients to detect alertsPhysical Layer Monitor system of KDDI observes circuits. When any alerts are detected,they are concurrently issued at KDDI Circuit Division.
Professional management & operationAPAN JP Site NOC
Commercial ISP backbone Stability and reliability are important above all
Redundant configuration is indispensable for trouble avoidance and load distribution of equipmentAlthough the network scale is large, network design is simple
Monitoring the connectivity of L2/L3 layer level mainly It is difficult to grasp the application level of each user’s flowBut it is very important to check the trend of end-to-end communication
Substantial operation manualSince equipments are extensive, operation manual must be substantial in management of equipment or network compositionSince there are many operators, unified observance of an operation policy
Operation of hierarchical networksEach edge/access /backbone network has its own best-suited operational policy and system
Quick notice of troubleSLA (Service level agreements): Guarantee of the notice time e.g. within 30 minutes
Network Operation in line with Network Characteristics
35
R&D Networks 1Network performance and high-speed bandwidth are required
Must support high-speed application where one user uses several 10Mbps ~ 10Gbps throughputFlexibly provide high performance network for every experiment or demonstration
Allocation of network resources based on operators viewCoordinate so that high-speed demonstration may not be performed simultaneously
Maintenance for physical and logical configuration change is performed frequently
In response to user demand, we have to change configurationbecause scale of equipment is limited
Network operation range is wideManaging not only backbone but also near-the-end host is required
Network Operation in line with Network Characteristics
R&D Networks 2Test bed operation of advanced technology and new equipment
We actually provided vendors with some problem reports of Juniper & Procket routerNew operation and trouble-shooting method is always searched for
Disclose operation information as much as possibleResearcher and other NOC operator can check network operation situationCollected operation data activate network research
Troubles causing long outage time are noticeableThere are only few environment where equipments are installed in a housing sitewith operation by 24/7 NOC (especially in Asia)
Network Operation in line with Network Characteristics
36
Shortening of trouble-handling timeStart trouble-handling and announce the information quickly
Operation tools enable us to issue trouble tickets automatically and announce information quickly
Shorten trouble-shooting timeRemote trouble-shooting from other areas
( cf. Router Proxy on Abilene)
World Wide Information sharingInstallation of shared information server providing the following information
Performance and Operation status of networkTrouble and Maintenance information
Redundant Network configurationRedundant configuration is very effective in realizing high availability. It is desirable that we establish redundant configuration as much as possible.
Proposal for Improving Network Service Level
Operation of lower layersFor operation, it is very important to check the status of circuits in cooperation with circuit carriersAs a recent trend, backbone network based on L2 or Lambda is conspicuous
Layer2Difficulty in finding bottle-necks Apply L3 monitoring technology e.g. ICMP ping, traceroute, other measurement toolVLAN ID management from end-to-end
LambdaOperators can’t monitor and measure performance of circuit/linkBurden for operation on end router/user
Proposal for Improving Network Service Level
37
Discussion & Question
Today’s Assignment 1. Describe monitoring method which is suitable to monitor
the following issues as concretely as possible.1-1. Connectivity to a PC in internal-network1-2. WWW service on web server1-3. Detecting down/up of interface on Ethernet switch promptly1-4. Traffic of GbE interface on a router1-5. Checking the utilization rate for P2P application in whole traffic
2. Describe merit of housing equipments in a data center following three categories.
2-1. Network Connectivity 2-2. Environment2-3. Operation
3. Give 1-2 lines of feedback