Network Monitoring and Data Center Operation

1

Network Monitoring andData Center Operation

KDDI/APAN-JP/JGNⅡ

Jin [email protected]

Self-Introduction

Jin Tanaka [email protected]

KDDI Japanese Telecommunication CarrierOtemachi Technical Center

Network OperatorWorked as network engineer of Commercial ISP for 2 yearsCurrently working as network operator of Research, Development &Education networks

APAN-JP NOC (AS7660)http://www.apan.net : APANhttp://www.jp.apan.net : APAN-JP pagehttp://www.jp.apan.net/noc : NOC page

JGN2 international NOChttp://www.jgn.nict.go.jp/ : JGN2

2

Agenda

1. Basic Knowledge of Network Monitoring

2. Network Monitoring Tools3. Advanced Tools for Measuring

Network Performance4. Data Center Operation 5. Discussion & Question

Basic Knowledge of Network Monitoring

3

Why Network Monitoring is necessary?

Reliability of network is considered to be more and more important

Lifeline, Business, etc. Mission Critical

Occurrence of trouble is inevitable on any network

With current IP technology, it is difficult to make a network without trouble!

In order to shorten unavailable timeDetect trouble at an early stageComplete trouble-shoot quickly

In order to grasp the situation of networkAvailability, Performance, Routing

What is Monitoring 1Basic way of Monitoring

Classified into three monitoring waysIn Internal Network (mostly) Via External NetworkNon-network (Emergency case) 1, Monitoring in internal

Network (mostly)

2, Monitoring via ExternalNetwork - via Peering Network- via the Internet

3, Independent access(Emergency case)- ISDN, PSTN

Internal network

External network

Monitoring Machine

4

What is Monitoring 2Scheme of Monitoring

1. Determine the monitoring target (What is the target for monitoring?)

2. Set up the monitoring node- Ping to target, SNMP polling

3. Establish the threshold- Ping / polling interval, SNMP MIB

4. Threshold exceeded

5. Notice the alert- Sound, mail, pop-up

Monitoring is realized in a repetition of the above flow.Trouble-shooting is started when the notice(5) is judged to be trouble!

Determination of Monitoring Target

Select target which is suitable for checking normality of network service What is the target for monitoring?

RouterDead or Alive? Status? Performance? Routing?

ServerDead or Alive? Status? Damon? Service Port?

Traffic, etc. Increase or decrease? Dos Attack? Performance? Environment?

5

Monitoring Method 1

Examine how to monitor the target Active monitor or Passive monitor

Polling = Monitoring machines give message in watching target

Useful for checking the current statusICMP/SNMP polling…

Receive trap message from targetUseful for detecting the status change

SNMP trap, syslog…Statistics data

Useful for grasping the trend and transitionSelect the Monitoring Tool

Ping (ICMP), SNMP, Monitoring Tool, Original Tool, etc.Check the monitoring Route to Target

Internal or External network

Monitoring Method 2

Examine the frequency of monitoring Monitoring the target on a case-by-case basis or regular basis

Is it necessary to monitor regularly using monitoring tool or system?

Critical target in providing network service Statistics data useful to trouble-shooting

Determination of monitoring interval5/15/30・・ minutes,・・・1/8/24・・・hours,・・・

Establish the threshold for alert Required for generating alert by quantitative change

The best monitoring method is realized in environment similar to actual service condition !

6

Notification of Alert

How to notify alertSelection of suitable alert notification function isindispensable

GraphicalPop up message, flashing icon on display

MailFor checking the condition, sent regularlySent only when there is state change

SoundAlert has no meaning if operators do not notice thenetwork trouble !

Network Monitoring Tools

7

- ICMP/Ping Polling 1 -

Check IP reachability by ICMP echo/replyAdditional information

RTT (Round Trip Time)Packet LossTTL (Time to Live)

Most standard way of checking node activityTime series RTT/Packet loss data becomes important information when measuring link performance

ICMP echo

ICMP echo reply

RTT: xx msecPacket Loss: xx %

TTL: xx

- ICMP/Ping Polling 2 -

Optional Parameter In case of daily operation

Packet size (byte)Sending interval (sec)Sending count (n)Timeout (sec)TTL (n)Pattern (0x????)etc.

At Monitoring systemSending interval Sending countTime out

Set up the value which is adapted for critical level or service level!

8

UDP/TCP polling

Effective in monitoring service ports of serverUsing client for service

DNS - nslookup

Using telnetWWW,SMTP,POP

Using toolRadius - radping

Telnet with service port

reply

bash-2.05$ telnet ns.jp.apan.net 80Trying 203.181.248.3...Connected to ns.jp.apan.net.Escape character is '^]'.get<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>501 Method Not Implemented</title>

:

SNMP - Framework -

SNMP: Simple Network Management ProtocolPolling: UDP 161, Trap : UDP 162 Protocol for monitoring/managing equipment via networkEnables us to monitor the state and traffic of various equipment without being dependent on venderManagement is realized by UDP between…

monitoring/managing server : Managere.g. HP Open View, Sun NNM

network equipment : Agent resides in devicee.g. Unix daemon、Cisco IOS

Most general technique for acquiring detailed information from a router or a switch

9

SNMP - Version -

SNMP v1 RFC1157When Manager requests, Agent returns responseAgent sends trap when specific event has occurred

SNMP v2 RFC1902Basic of features are almost the same as those of v1Additional regulation

64bit counter : can deal with large numerical value Get-bulk request : used to efficiently retrieve large blocks of dataSupports the use of encryption of messages

SNMP v3 RFC2271～2275Additional regulation

Various security function : MD5 user authentication, DES encryptionDynamically configure the SNMP Agent using SNMP SET commands

SNMP - MIB & OID -

SNMP Manager can acquire the management information defined by MIB(Management Information Base) from Agent

Current version : MIBv2 RFC 1213MIB is the aggregate of object (information) on the equipment which SNMP Agent holdsIdentifier is defined for each object = OIDMIB performed by Agent is roughly divided into:

MIBv2 : standard, public, specified by IETFEnterprise MIB : private, specified by vendor company

10

SNMP - MIB Tree -Objects are managed by the tree Expressed in a row of values divided by the period

root

iso(1)ccitt(0) Joint-iso-ccitt(2)

org(3)

dod(6)

Internet(1)

directory(1) mgmt(2) exprimental(3) private(4)

mib(1) enterprise(1)

Standard MIBs Vendor-specific MIBs

SNMP - OID -OID Expression

iso(1). org(3). dod(6). internet(1). mgmt(2). mib2(1)-> .1.3.6.1.2.1e.g. sysDscr = .1.3.6.1.2.1.1.1 = mib-2.1.1 = system.1

Measures the performance of the underlying SNMP implementation on the managed entity and tracks things such as the number of SNMP packets sent and received. 1.3.6.1.2.1.11snmp

There are currently no objects defined for this group, but other media-specific MIBs are defined using this subtree.1.3.6.1.2.1.10transmission

Tracks various statistics about EGP and keeps an EGP neighbor table.1.3.6.1.2.1.8egp

Tracks UDP statistics, datagrams in and out, etc.1.3.6.1.2.1.7udp

Tracks, among other things, the state of the TCP connection (e.g., closed, listen, synSent, etc.).1.3.6.1.2.1.6tcp

Tracks things such as ICMP errors, discards, etc.1.3.6.1.2.1.5icmp

Keeps track of many aspects of IP, including IP routing.1.3.6.1.2.1.4ip

The address translation (at) group is deprecated and is provided only for backward compatibility. It will probably be dropped from MIB-III.1.3.6.1.2.1.3 at

Keeps track of the status of each interface on a managed entity. The interfaces group monitors which interfaces are up or down and tracks such things as octets sent and received, errors and discards, etc.1.3.6.1.2.1.2interfaces

Defines a list of objects that pertain to system operation, such as the system uptime, system contact, and system name.1.3.6.1.2.1.1system

DescriptionOIDSubtreeName

11

SNMP - SNMP Message -SNMP version : Check the version of SNMP(0 is for version 1)Community : Password between Manager and AgentPDU (Protocol Data Unit) : Actual command

Manager -> AgentGetRequest

Used to request the values of one or more MIB variablesGetNextRequest

Used to read the values of variables in the MIB sequentially. It is often used to read through a table of values. After reading the Getrequest,GetNextRequest are used to read through the remaining rows

SetRequestUsed to update one of the MIB values

Agent -> ManagerGetResponse

Returned as answer to GetRequest or GetNextRequest message

TrapUsed to notify significant events (e.g. a cold or a warm restart…)

SNMP - SNMP Message Handling 1 -

SNMP Manager SNMP Agent

GetRequest (What is the value of MIB?)

GetResponse (The value is XXXX!)

GetNextRequest(What is the next value of MIB Tree ?)



SetRequest (Modify the value of OID)

Trap (Problem happened!)

12

SNMP - SNMP Message Handling 2 -

Command examples

GetRequestinetapan@tools:~> snmpget -v2c -c xxxx tpr2.jp.apan.net .1.3.6.1.2.1.2.2.1.4.136IF-MIB::ifMtu.136 = INTEGER: 9192

GetNextRequestinetapan@tools:~> snmpget -v2c -c xxxx tpr2.jp.apan.net systemSNMPv2-MIB::system = No Such Object available on this agent at this OIDinetapan@tools:~> snmpwalk -v2c -c xxxx tpr2.jp.apan.net systemSNMPv2-MIB::sysDescr.0 = STRING: m20 internet router, kernel 6.2R3.10SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.2636.1.1.1.2.2DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (423280751) 48 days, 23:46:47.51SNMPv2-MIB::sysContact.0 = STRING:SNMPv2-MIB::sysName.0 = STRING: tpr2SNMPv2-MIB::sysLocation.0 = STRING:SNMPv2-MIB::sysServices.0 = INTEGER: 4

SetRequestinetapan@tools:~> snmpset ‒v2c ‒c xxxx tppr.jp.apan.net system.sysLocation.0 system.sysLocation.0 = "" inetapan@tools:~> snmpset ‒v2c ‒c yyyy tppr.jp.apan.net system.sysLocation.0 s “Tokyo, JP“system.sysLocation.0 = “Tokyo, JP" inetapan@tools:~> snmpset ‒v2c ‒c xxxx tppr.jp.apan.net system.sysLocation.0

system.sysLocation.0 = “Tokyo, JP"

SNMP - Trap Message -The way for Agent to inform Manager about event of something undesirableTrap originates from Agent and is sent to the trap destination, as configured within Agent itself When Manager receives a trap, it needs to know how to interpret it PDU

Enterprisevendor identification (OID) for the agent

AgentAddressThe IP address of the node where the trap was generated.

Trap TypeGeneric / Specific (not used)

TimestampThe length of time between the last re-initialization of the agent that issued a trap and the moment at which the trap was issued

13

Monitoring Software - HP OpenView -

HP OpenView Network Node Manager ®http://www.openview.hp.com/products/nnm/index.html

OverviewAuto discovery and mappingDrill-down views (Hierarchy Map) Fault monitoring : ICMP / SNMP pollingEvent monitoring : Trap receiving/Event configuration SNMP tools : Status pollingMIB BrowserWeb-based reportsExtended software is enhanced Platform : Windows 2000/XP, Solaris 8/9, HP-UX

APAN-JP NOC monitors its network using OpenView mainly!

Monitoring Software - HP OpenView Sample 1-

OpenView Contracture

Event log

ICMP polling for connectivity check

Network map

Router map Network sub-map

14

Monitoring Software - HP OpenView Sample 2-

OpenView Tools

Snmp configuration for polling- parameters- community

Event configuration

Data collection & Thresholds for SNMP

Monitoring Software - Nagios Overview-

Nagios ®Freely available from http://www.nagios.org

OverviewA host and service monitor designed to inform you of network and end system problemsProvides simple ping availability of resources on the networkWorks with a set of “plugins” to provide local and remote host service statusCustom “plugins” are relatively easy to developWeb-based monitoring systemPlatform : Linux, UNIX

APAN-JP NOC uses Nagios as secondary monitoring system!

15

Monitoring Software - Nagios Sample 1-

NagiosService Overview For All Host Groups

Service Status Details For All Hosts

Monitoring Software - Nagios Sample 2-

NagiosNetwork Map For All Hosts

Event log

16

MRTG (Multi-Router Traffic Grapher)

Overview Monitors the load of network equipment using SNMP, mainly used for creation of traffic graphExcellent graphing tool developed by Tobias OetikerPlots graph with any two variables against time, It is graph-izedwith PNG format on HTML pageAble to create scripts to feed data into MRTGImplements data collection, image, web-page collectionVery widely deployed in large networks and still being actively developedPlatform : UNIX system / Windows NTSupports SNMPv2 : able to read 64bit countershttp://people.ee.ethz.ch/~oetiker/webtools/mrtg/

MRTG - Workflow -Display of graph

Green area typically represents incoming maximum bits per secondBlue line typically represents outgoing maximum bits per second

Workflow1.Read configuration file2.Collect graphing data from network equipment, based on

configuration3.Update database file and generate graph4.If required, generate HTML file

MRTG performs above workflow then completesSince MRTG collects data of the past 5 minutes (default value of source code), it is desirable to set “crontab” for every 5 minutes

17

MRTG - Data Storage -

Daily grafh/5min

Weekly grafh/30min

Monthly grafh/2hours

Yearly grafh/1day

Data StorageKeeps 5 minute data only for 2.5 days. The data is thrown away afterward.

There is no referring to historical data with high resolution Keeps 1-day data for approx. 2 years

RougherResolution

daily2.5 days6005 minutes

Yearly2 years7311 day

Monthly50 days6002 hours

Weekly12.5 days60030 minutes

GraphStorage period

Num of record

Interval

MRTG - Configuration 1 -

MRTG Configurationcfgmaker

Helps to create configuration file formExample

cfgmaker -global ‘WorkDir: /home/httpd/html/mrtg’ ¥-global "Options[_]: bits,growright’ ¥-output /home/httpd/html/mrtg/cfg/mrtg.cfg ¥n

[email protected]

Graph & log data: /home/httpd/html/mrtgConfiguration file: /home/https/html/cfg/mrtg.cfgOption : unit = bits(bps), Horizontal axis = grow right way

Detailed informationhttp://people.ee.ethz.ch/~oetiker/webtools/mrtg/cfgmaker.html

18

MRTG - Configuration 2 -Target Configuration

Target ExpressionTarget[<target name>]:<target kind>:<community>@<address>

<target name> : Identify equipment <target kind> : Measurement item<community> : SNMP community string <address> : Hostname or IP address of equipment

SNMP data collection specification methodBasic / Port (ifindex)

Target[myrouter]: 2:[email protected]

Explicit OIDs / MIB Variables Target[myrouter]: 1.3.6.1.2.1.2.2.1.14.1&1.3.6.1.2.1.2.2.1.20.1:public@myrouter

Target[myrouter]: ifInErrors.1&ifOutErrors.1:public@myrouter

You can use cfgmaker to generate references with the options-- ifref=?iｆref=ip: Interface by IP ifref=descrf: Interface by Descriptionifref=name: Interface by Nameifref=eth: Interface by Ethernet Address

MRTG - Configuration 3 -

Example of ConfigurationTarget[la]: ifHCInOctets¥so-2/0/0&ifHCOutOctets¥so-2/0/0:[email protected]:::::2MaxBytes[la]: 300000000Title[la]: Traffic Analysis of TransPAC LA LinkPageTop[la]: <H1>Traffic Analysis of TransPAC LA link</H1>WithPeak[la]: ymwDirectory[la]: tpr2Options[la]: bits, growright

Target[la-err]: ifInErrors¥so-2/0/0&ifOutErrors¥so-2/0/0:[email protected][la-err]: 300000000Title[la-err]: Packet Error for TransPAC LA linkPageTop[la-err]: <H1>Packet Error for TransPAC LA link</H1>Directory[la-err]: tpr2Options[la-err]: growright, integer, nopercentYLegend[la-err]: Number of Error PacketsShortLegend[la-err]: nLegend1[la-err]: Number of Error Packets for Incoming TrafficLegend2[la-err]: Number of Error Packets for Outgoing TrafficLegend3[la-err]: Peak of Number of Error Packets for Incoming TrafficLegend4[la-err]: Peak of Number of Error Packets for Outgoing TrafficLegendI[la-err]: In:LegendO[la-err]: Out:WithPeak[la-err]: w

19

MRTG - Comments -

Comments / Disadvantages If you are to monitor a lot of devices (1000s), it is better to have a fast diskIf using external monitoring scripts, a fast processor and a lot of memory is necessaryNot particularly fast when compared to other data retrieval and storage schemes (Flat text files can slow down processing.)MRTG can’t customize graphing periodsFlat text files are difficult to process when scripting against the dataUse 64bit counters with SNMPv2 for OC3-OC192 speed interface, GbE if it is 115Mbps traffic can wrap 32bit counters around in 5 minutes MRTG can’t modify collected data which is summarized Only two variables are available in processing a graph

RRDtool (Round Robin Database Tool)Overview

Successor to MRTG Developed by the same developer of MRTG : Tobias OetikerTool group for RRD can flexibly define data item, time interval, data amount, graph depiction, etc.Binary file format that can store data at any interval for any length of time

File does not grow in size over timeAbility to make custom graphs across user-defined intervals

Ability to graph multiple variables on a single graphAdditional scripts are necessary in creating graphs and web-page

25-30 percent faster than MRTGDoes not have the function to collect datahttp://people.ee.ethz.ch/~oetiker/webtools/rrdtool/

20

RRDtool - Architecture -Comparison of architecture between MRTG and RRD

router

router

server

text

SNMPengine

FrontendProgram

FrontendProgram

Graph

Index

Graph

Index

RRD

log

i ll

i ll

RRDtool - Basic Usage -

Basic usage of RRD toolsSet up new Round Robin Database (RRD) ・・・①

Define RRD used as vessel of dataCommand : rrdtool create filename

Store new set of values into RRD periodically ・・・②Write the data collected by frontend program in RRDCommand : rrdtool update filename

Generate Graph ・・・③Create graph from data stored in one or several RRDsCommand : rrdtool graph filename (specify the graph name to generate)

RRD

data data Graphdata・・・・・

①

②③

21

RRDtool - Practice -

ExampleObject

Gigabit Ethernet SwitchDefinition

Definition of RRD record

Ability to describe peak graph from data of 1-day to 10-years

Yearly2 years7311 day

4 hours6 hours3601 minute

10 years10 years9154 days

Monthly50 days6002 hours

Daily 2 days5765 minutes

GraphStoragePeriod

Num of RRD file

Interval

RRDtool - Create -

Set up a new Round Robin Database (RRD)

DS : Define the data itemCOUNTER: continuous increasing counters 60 : if no new data is supplied for more than 60

sec, it is considered as “unknown”0 : minimum acceptable value (byte)125000000 : maximum acceptable value (byte)

RRA (Round Robin Archive) : Define the data consolidations

AVARAGE/MAX: average /maximum of consolidated of data0.5 : consolidation interval is be made up from *UNKNOWN* data while the consolidated value is still regarded as known.

- Average 50%. MAX 20% or 10%1: consolidated data point where the data then goes into the archive 360 : how many generations of data values are kept in RRA

Command Example

/usr/local/rrdtool-1.0.46/bin/rrdtool create ¥/home/httpd/html/traffic/traffic_vlan.rrf ¥–step 60 ¥DS:vlan2in:counter60:0:125000000 ¥DS:vlan2out:counter60:0:125000000 ¥DS:vlan7in:counter60:0:125000000 ¥DS:vlan7out:counter60:0:125000000 ¥

:RRA:AVERAGE:0.5:1:360 ¥RRA:AVERAGE:0.5:5:576 ¥RRA:AVERAGE:0.5:120:600 ¥RRA:AVERAGE:0.5:1440:731 ¥RRA:AVERAGE:0.5:5760:915 ¥RRA:MAX:0.2:5:576 ¥RRA:MAX:0.1:120:600 ¥RRA:MAX:0.1:440:731 ¥RRA:MAX:0.1:5760:915 ¥

22

RRDtool - Update -

Stores a new set of values into RRD periodicallyData collection

Collect the data from targets using frontend program Original tool Cricket - http://cricket.sourceforge.net/Orca - http://www.orcaware.com/orca/SNAPP - http://sourceforge.net/projects/snapp/

Updating an RRDFeed collected data into a RRD database using following commands

Command Examplerrdtool update /home/httpd/html/traffic/traffic_vlan.rrd ¥--template in:out N:11222:1

‘N’=Update time is set to be the current time DS1: DS2The data sources are defined in the RRD

The name of the RRD you want to update.

RRDtool - Graph 1 -

Generating Graph -1-Command Examplerrdtool graph /home/httpd/html/traffic/traffic.png -s -4h –w 800 –h 800 –a PNG ¥–t “VLAN Traffic” –v “bit/s” ¥DEF:vlan2in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan2in:AVERAGE ¥DEF:vlan2out_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan2out:AVERAGE ¥DEF:vlan7in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan7out:AVERAGE ¥DEF:vlan7in_ave=/home/httpd/html/traffic/traffic_vlan.rrd:vlan7out:AVERAGE ¥CDEF:vlan2in_ave_bit=vlan2in_ave,8 * ¥CDEF:vlan7in_ave_bit=vlan7in_ave,8 * ¥CDEF:vlan2out_ave_bit=vlan2out_ave,-8 * ¥CDEF:vlan7out_ave_bit=vlan7out_ave,-8 * ¥AREA:vlan2in_ave_bit#ff5e5e:VLAN2-in ¥STACK:vlan7in_ave_bit#5eff5e:VLAN7-in ¥AREA:vlan2out_ave_bit#aa0101:VLAN2-out ¥STACK:vlan7out_ave_bit#0101aa:VLAN7-out ¥

Options-s: start time (default : seconds), -e: end seconds (default : seconds), -w,h : width and height pixels, -a : image format GIF|PNG, -t : Graph title,-v vertical-label text

23

RRDtool - Graph 2 -Generating a Graph -2-

DEF Define virtual name for data source

DEF:<vname>=<RRDfilename>:<DS-name>:CFCF: consolidation function

select AVARAGE, MAX, MIN, LAST ( Newest data)

CDEFCreate new virtual data source by evaluating mathematical expression

CDEF:<vname>=rpn-expression (Reverse Polish Notation)

Graph depiction parameter<Style>:<vname>#<color>:<legend>

LINE : Plot for the request data, using the color specifiedAREA : Area between 0 line and the graph line will be filled with the color specified STACK : Graph gets stacked on top of the previous LINE, AREA, or STACK graph

By updating graph generation periodically using “crontab”, you can see updated graphs on the Web

RRDtool - Sample -

Sample Graph

http://mrtg.jp.apan.net/cricket/router-interfaces/

24

Advanced Tools for MeasuringNetwork Performance

Iperf - Overview -

Iperf is used to measure TCP and UDP bandwidth performance

Tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristicsAble to "memory to memory“ transfer to remove disk IO influencing the resultsClient and server can have multiple simultaneous connections Supporting IPv6 , Platform : UNIX systems / Windows / Mac OSEffective in investigating circuit quality, when a new circuit is establishedhttp://dast.nlanr.net/Projects/Iperf/

25

Iperf - Mode -

Client Server

TCP modeMeasure bandwidthReports MSS (Maximum Segment Size)/MTU (Maximum Transfer Unit)

size and observed read sizesSupports TCP window size via socket buffers

UDP mode Client can create UDP streams of specified bandwidthMeasure packet loss, delay, jitter

Since traffic is generated actually, it must be careful to operate Iperf!

Iperf - Example -Test result Example

Traffic

test% iperf -u -i1 -s------------------------------------------------------------Server listening on UDP port 5001Receiving 1470 byte datagramsUDP buffer size: 1.00 MByte (default)------------------------------------------------------------[ 3] local 203.181.249.xxx port 5xxx connected with 203.181.248.xx port 32781[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams[ 4] 0.0- 1.0 sec 61.0 MBytes 511 Mbits/sec 0.006 ms 0/43492 (0%)[ 4] 1.0- 2.0 sec 61.0 MBytes 511 Mbits/sec 0.005 ms 0/43479 (0%)[ 4] 2.0- 3.0 sec 61.0 MBytes 511 Mbits/sec 0.005 ms 1/43478 (0.0023%)[ 4] 3.0- 4.0 sec 61.0 MBytes 511 Mbits/sec 0.007 ms -1/43478 (-0.0023%)[ 4] 4.0- 5.0 sec 61.0 MBytes 511 Mbits/sec 0.004 ms 0/43478 (0%)

26

BWCTL (Bandwidth Control)

BWCTL is a resource allocation and scheduling daemon for arbitration of iperf testsBWCTL client application works by contacting a bwctld process on both endpoints of test systemsRequires that NTP be running to synchronize the system clock Open mode : everyone can useAuthentication mode : need to exchange AES key Support IPv6, Platform : UNIX systems Developed by Internet2 http://e2epi.internet2.edu/bwctl/

Users attempting to run bandwidth tests used to be not certain whether or not their test was scheduled in a time frame where other tests were not to run

OWAMP (One-way Active Measurement Protocol)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hostsIt is possible to collect active measurement data

- e.g., one-way delay, packet loss, jitterNTP must be setup correctly on the system to calculate a reasonable estimate of time error and to stabilize clockSupport IPv6. Platform : UNIX systems Current Draft : draft-ietf-ippm-owdp-10.txtDeveloped by Internet2 http://e2epi.internet2.edu/owamp/

Roundtrip-based measurement can not identify the delay in each direction, especially when asymmetric routes are used

icmp ping : RTT owping : one-way

27

OWAMP - Protocol -Consists of two inter-related protocols

OWAMP-ControlUsed to initiate, start/stop test sessions, and fetch test results

OWAMP-TestDefine the format of probe packet

Sample measurement datahttp://pe2.koganei.wide.ad.jp/cgi-bin/owd-stathttp://qpe.jp.apan.net/cgi-bin/owd-stat

Netflow - Overview -Overview

Enables IP traffic flow analysis without probesInvented and patented by Cisco

Juniper (called cflowd), Foundry, ･･･ many venders are supporting

Flow cash data on routers is exportedto a flow tool, so that traffic flow is to be analyzed

flow Definition: Source IP addressDestination IP addressSource portDestination portLayer 3 protocol typeTOS byte (DSCP)Input logical interface

(ifIndex)

Core Network

Enable NetFlow Traffic

Collector(Solaris, HP-UX, or Linux)

UDP NetFlowExport

PacketsApplication GUI

28

Netflow - Flow Data -Flow data export

Enable NetFlow on the routerThere is difference in architecture between Cisco and Juniper routersTake care! the load of a router does not become high! - Check CPU, memory, bandwidth, sampling rate

Flow data collection & AnalysisPrepare the software for receiving flow-export data

flow-tools http://www.splintered.net/sw/flow-tools/cflowd http://www.caida.org/tools/measurement/cflowd/Cisco : NetflowCollector

Analyze traffic from raw data with softwareflow-scan http://net.doit.wisc.edu/~plonka/FlowScan/(If you want to graph-ize analysis data, I recommend you to use RRDtool)Cisco : CiscoWorks

Source and destination IP addressSource and destination TCP/UDP portsPacket and byte countsRouting information (next-hop address, source autonomous system (AS) number, destination AS number, source prefix mask, destination prefix mask)

Netflow - Example -Netflow Example

29

Observatory - Overview -

Observatory Project Abilene Observatory http://abilene.internet2.edu/observatory/

Abilene backbone http://abilene.internet2.edu/

APAN Observatory http://www.jp.apan.net/NOC/Observatory/

System which collects network performance data at backboneCollected data can be used for operation and researchAPAN is preparing three types data collected and shared publiclyacross JP-US link

Latency data --- using OWAMPNetflow data --- using Netflow (Juniper, Procket & flow-tool) Throughput data --- Iperf (BWCTL)

Average RTT 190ms

APAN TokyoXP

Chicago/Indianapolis

Los Angeles

Observatory – Formation -

Observatory system will help grow up R&D networks!

Observatory•Developed by NOC researchers & maintained by NOC engineers/operators

•Common Tools with High Priority

•Output data in standard format

Data with authentication

Software,Papers, etc.

ResearchersResearchers

Researchers can get measurement data of the global networks and collaborate with foreign researchers.

NOC Advanced Service

NOC Basic ServiceHelpful in operation

30

Introduction of other advanced tools

Abilene Router Proxy - Overview -

Similar to Looking Glass, but with some advanced functionsWeb-form allows users to submit various commands to backbone routersAllows remote network operators to troubleshoot problems without contacting NOCUnix-basedUses scripted telnet to login to the routers and grab the outputNot designed for high-speed access to backbone informationVery useful operation tool among inter-domain networkEnable us to view operational situation of almost all Abilene routers

http://ratt.uits.iu.edu/routerproxy/abilene/

Introduction of other advanced tools

Abilene Router Proxy - Sample -

31

SummarySummaryFor summary, below is the table showing ranks of each tool according to four core criteria

( 1 : lowest – 5 : highest)

5543Iperf (BWCTL)

5532OWANP

4543NetFlow

2543Openview

Accessible Useful for operation

Useful for trouble-shoot Low Cost

Nagios 4 2 4 5

MRTG 4 5 4 5

RRD 3 4 5 5

Router Proxy 4 5 5 5

Data Center Operation

32

Data Center operation- Service -

1. Circuit serviceLeased circuit, ATM/SONET/Ethernet/VPN

2. Housing/co-location serviceRack co-location, open co-locationSecurity consideration

Security camera, Security entry system

3. Site Management serviceBasic service

Check entering/leaving, check the power supply & air condition, check equipment lamp, power off/on equipment

Monitoring servicePing monitoring, service port monitor, log-monitor, etc.

Report serviceTraffic graph report (MRTG), resource report, etc.

Alternate processing of routine workTape change, stated equipment re-boot, etc

Assistance in trouble-shooting Technical support over telephone, detection of a trouble points, etc

4.Professional management & operationOutsourcing of network operationBusiness solution

Flexible operation to best meet user’s requirements and characteristics of user network

Routing : IGP/EGP, Multicast, IPv6, etc.Cover almost all layers (Layer 1,2,3,4)Server maintenance : DNS, Web, Mail, etc.Negotiations with external networkManagement of network resource : IP address, VLAN, Rack spaceMonitoring SecurityNetwork consulting Face-to-face communication

Data Center operation- Service 2 -

33

Data Center operationModel

User Network

Data Center

the Internet

IX / the Internet

NOC

entering/leaving

power supply

monitoring negotiation/cooperation

routing/traffic tuning

resource management Internal

External

KDDI NetworkExternal NOC

security

Location:NOC is located at KDDI Otemachi Bldg 12F in Tokyo, with equipments installed on the 5F of the same bldg.

Staff:24×7 Operators standbyOperators are also in charged of operations for othernetworks

Scientific, Academic, Commercial ISP

Duties:Opening and closing of Trouble TicketsReceiving problem reportsTrouble-shootingDevelopment and maintenance of measurement and operation tools

Professional management & operationAPAN & JGN2

34

KDDICircuit Division

Operation StaffOperation Staff

Open ViewNNM

Mail & Web Client

PhysicalLayer Monitor

KDDIAPANKDDIAPAN

ハブ

ハブ

ハブ

12F

5F

APAN Equipment

HP Open View works independently in the NOC segmentNOC staff is utilizing Mail & Web clients to detect alertsPhysical Layer Monitor system of KDDI observes circuits. When any alerts are detected,they are concurrently issued at KDDI Circuit Division.

Professional management & operationAPAN JP Site NOC

Commercial ISP backbone Stability and reliability are important above all

Redundant configuration is indispensable for trouble avoidance and load distribution of equipmentAlthough the network scale is large, network design is simple

Monitoring the connectivity of L2/L3 layer level mainly It is difficult to grasp the application level of each user’s flowBut it is very important to check the trend of end-to-end communication

Substantial operation manualSince equipments are extensive, operation manual must be substantial in management of equipment or network compositionSince there are many operators, unified observance of an operation policy

Operation of hierarchical networksEach edge/access /backbone network has its own best-suited operational policy and system

Quick notice of troubleSLA (Service level agreements): Guarantee of the notice time e.g. within 30 minutes

Network Operation in line with Network Characteristics

35

R&D Networks 1Network performance and high-speed bandwidth are required

Must support high-speed application where one user uses several 10Mbps ~ 10Gbps throughputFlexibly provide high performance network for every experiment or demonstration

Allocation of network resources based on operators viewCoordinate so that high-speed demonstration may not be performed simultaneously

Maintenance for physical and logical configuration change is performed frequently

In response to user demand, we have to change configurationbecause scale of equipment is limited

Network operation range is wideManaging not only backbone but also near-the-end host is required


R&D Networks 2Test bed operation of advanced technology and new equipment

We actually provided vendors with some problem reports of Juniper & Procket routerNew operation and trouble-shooting method is always searched for

Disclose operation information as much as possibleResearcher and other NOC operator can check network operation situationCollected operation data activate network research

Troubles causing long outage time are noticeableThere are only few environment where equipments are installed in a housing sitewith operation by 24/7 NOC (especially in Asia)


36

Shortening of trouble-handling timeStart trouble-handling and announce the information quickly

Operation tools enable us to issue trouble tickets automatically and announce information quickly

Shorten trouble-shooting timeRemote trouble-shooting from other areas

( cf. Router Proxy on Abilene)

World Wide Information sharingInstallation of shared information server providing the following information

Performance and Operation status of networkTrouble and Maintenance information

Redundant Network configurationRedundant configuration is very effective in realizing high availability. It is desirable that we establish redundant configuration as much as possible.

Proposal for Improving Network Service Level

Operation of lower layersFor operation, it is very important to check the status of circuits in cooperation with circuit carriersAs a recent trend, backbone network based on L2 or Lambda is conspicuous

Layer2Difficulty in finding bottle-necks Apply L3 monitoring technology e.g. ICMP ping, traceroute, other measurement toolVLAN ID management from end-to-end

LambdaOperators can’t monitor and measure performance of circuit/linkBurden for operation on end router/user

Proposal for Improving Network Service Level

37

Discussion & Question

Today’s Assignment 1. Describe monitoring method which is suitable to monitor

the following issues as concretely as possible.1-1. Connectivity to a PC in internal-network1-2. WWW service on web server1-3. Detecting down/up of interface on Ethernet switch promptly1-4. Traffic of GbE interface on a router1-5. Checking the utilization rate for P2P application in whole traffic

2. Describe merit of housing equipments in a data center following three categories.

2-1. Network Connectivity 2-2. Environment2-3. Operation

3. Give 1-2 lines of feedback

Network Monitoring and Data Center Operation

Technology

Transcript of Network Monitoring and Data Center Operation