Finding Network Problems that Influence Applications ...

83
22 Mar 2005 v0.4 Finding Network Problems that Influence Applications: Measurement Tools Internet2 Performance Workshop

description

 

Transcript of Finding Network Problems that Influence Applications ...

Page 1: Finding Network Problems that Influence Applications ...

22 Mar 2005 v0.4

Finding Network Problems that Influence Applications:Measurement Tools

Internet2 Performance Workshop

Presenter
Presentation Notes
Introduce yourself. --- These slides were prepared by Matt Zekauskas, using original material inspired by Matt Mathis, NLANR DAST and experience; example material created by Rich Carlson and Russ Hobby, with information from the e-VLBI project at MIT Haystack Observatory. Copyright © 2004, Internet2. All Rights Reserved, except that permission is expressly granted for others to use in noncommercial educational materials as long as attribution is given to Internet2 and the authors. [[Replace with blessed text]]
Page 2: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 2

Outline

Problems, typical causes, diagnostic strategiesExamples showing usage of the tools we’ll be talking about todayEnd-to-End Measurement Infrastructure

Page 3: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 3

We Would Like Your Help

What problems are you experiencing?

Have you used a good tool?

Give us the benefit of your experience: successful problem resolution!

Presenter
Presentation Notes
Before we really start, I want to say that we don’t have all the answers. I’m going to tell you about common problems, and tools and techniques we’ve found useful. You’ll learn more about specific tools over the course of the day. However, if you have a common problem, or a particularly difficult problem, we’d like to hear about it. In fact, we collect “war stories” for publication on our web site. In addition, if you have a tool or technique that we don’t talk about today, please do speak up during the day, or send us details after the workshop.
Page 4: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 4

What Are The Problems? (1)

Packet lossJitterOut-of-order packets (extreme jitter)Duplicated packetsExcessive latency

• Interactive applications•TCP’s control system

Presenter
Presentation Notes
I’ll start off with a list of network problems that we find affect performance of most applications. Packet loss slows TCP (bulk data transfer), and causes dropouts with voice and “jaggies” with video. Jitter, or the change in the rate that packets arrive, can also cause TCP to slow down, or at least react to problems more slowly; excessive jitter in a real-time application can cause some packets to be treated as if they were lost, causing dropouts and video problems. If you have an interactive application, say remote control of a scanning-tunneling microscope, jitter makes it hard for humans to react. We can learn how to deal with latency, but we can’t adjust for arbitrary changes in latency. An Ohio state study of h.323 codecs found that jitter caused more problems than loss (up to some point). Out-of-order packets can be viewed as extreme jitter; beside the problems already listed, many applications are not written well (for example, early MPEG-2 and HDTV codecs), and out of order packets can cause more problems than lost packets. Applications should be written to be tolerant of out-of-order packets (since parallelism in the network can generate them naturally, and they actually occur quite frequently), but for now, reducing the number of out of order packets will improve application efficiency. Duplicate packets waste bandwidth, and in extreme cases can cause TCP to slow down or confuse real-time applications. Excessive latency makes it difficult for interactive applications (video conferencing, remote instrument control), although humans can compensate to some extent. TCP’s control system also has more trouble as latency increases, since it reacts more slowly. In general, it is best to engineer paths so that latency is minimized.
Page 5: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 5

For TCP

Eliminating loss is the goalNon-congestive losses especially trickyTCP: 100 Mbit Ethernet coast-to-coast:

•Full size packets… need 10-6 Ploss [Mathis]•Less than 1 loss every 83 seconds

http://www.psc.edu/~mathis/papers/JTechs200105/

GigE: 10-8, 1 loss every 497 seconds

Presenter
Presentation Notes
Let’s look at how “vanilla” Reno TCP (still the most commonly deployed TCP stack) reacts to losses to see how important limiting loss along a path is. If a path is congested, it is obvious (at least once the problem link is found) because link utilization is high. However losses can be caused for other reasons (which we will get to in a moment), and these “non-congestive” losses are especially hard to track down. However, those are the ones that are also important to eradicate as much as possible. Let’s say our goal is modest (for modern workstations): send 100 megabits from coast-to-coast. With full size Ethernet packets (1500 bytes for 100Mbps interfaces) you need a probability of packet loss on the order of one in a million. That’s one loss every 83 seconds. [[Note: this is an equation to give you a general idea; exact semantics depend on the version of TCP that is deployed; a single loss may get covered by “fast retransmit”, and the sender may never slow down.]] How about if you have gigabit Ethernet? Then the loss probability must be less than one in ten to the negative eighth power, or one loss every 497 seconds. The situation gets better if you can use so-called “jumbo-frames”, or 9000 byte packets, it goes back to close to the 100 megabit case. That’s one reason to try and make high-performance paths “9000-byte clean”. The situation also gets better with some of the newer TCP algorithms (high-speed TCP, BIC TCP, FAST TCP) which is why there’s a lot of research into new bulk-transfer control algorithms. But we would still like a way to find and remove non-congestive losses as much as possible.
Page 6: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 6

What Are The Problems? (2)

TCP: lack of buffer space•Forces protocol into stop-and-wait•Number one TCP-related performance problem.

•70ms * 1Gbps = 70*10^6 bits, or 8.4MB•70ms * 100Mbps = 855KB•Many stacks default to 64KB, or 7.4Mbps

Presenter
Presentation Notes
OK, we talked about network problems. However, you also should know that the number one reason for TCP not running at “full-speed” is for it to be starved for buffer space. Vendors ship TCP stacks with buffers that are tuned for the commercial Internet. If the buffer is too small, TCP, which uses a “sliding window” for flow control, must wait for packets to be acknowledged in order to advance the window and send more data. Essentially the sender is forced to stop and wait. You need to be able to buffer the number of bits you can send in one round-trip time at your desired speed. [[Note that there are also send and receive buffers; if the either is too small you can end up with this problem.]] For example, with a 70 millisecond round-trip time (more-or-less trans-continental North America), to sustain one gigabit per second you need 8.4 megabytes of buffer space. For 100Mbps at the same distance you need 855 kilobytes. Many stacks default to 64 kilobytes, which only allows 7.4 Mbps. One word of caution: network kilobits, megabits, gigabits are powers of 10. Memory kilobytes and megabytes are in powers of two, a kilobyte being 1024 bytes (2^10) and a megabyte being 1,048,576 bytes (2^20). More detail on TCP behavior from Rich: TCP has 2 buffers, send and receive. The sliding window is always used, not just if the buffer is too small. Since TCP delivers a reliable, in-order packet delivery service it needs to detect and recover from loss and mis-ordered packets. The sender must retain a copy of all packet sent in the event that IP packets are lost. If this buffer fills up, the sender must stop sending until ACKs are received. The receiver must also deal with out-of-order packets, so it maintains a reassembly buffer. This buffer also holds packets when the application is unable to process them. If this receive buffer fills up, the sender must stop sending until the application can process the data. So either buffer filling up can cause the sender to stop. Both sender and receiver must be tuned to eliminate buffer stalls. In some cases it may not be possible for the local host or sys-admin to fix the problem (i.e., the remote host has mis-set buffers).
Page 7: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 7

What Are The Problems? (3)

Video/Audio: lack of buffer space•Makes broadcast streams very sensitive to previous problems

Application behaviors•Stop-and-wait behavior; Can’t stream•Lack of robustness to network anomalies

Presenter
Presentation Notes
I also want to mention that the same problem carries up to the applications themselves. We won’t be speaking more about this today, but for video and audio (streaming media) the lack of buffer space in the application (in our world, MPEG-2 based applications are especially bad) means the application is very sensitive to packet loss or reordering. Of course, if your application is interactive, then increased buffering can lead to lag in response, which is not desirable, either. This generalizes to bad network application behavior, so that they are not robust to network changes or anomalies. Drops will occur. Reordering will occur. Even if only very occasionally. Even applications that would like to use TCP to do bulk transfer can do things like not hand enough data to TCP to allow it to stream over long distances. One that was brought to light recently is scp (and therefore ssh; this problem can also occur with standard FTP); popular versions of scp do not provide large enough buffers for TCP to stream. (There is a pointer to a good version of scp off a tcp tuning page at PSC mentioned later.)
Page 8: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 8

The Usual Suspects

Host configuration errors (TCP buffers)Duplex mismatch (Ethernet)Wiring/Fiber problemBad equipmentBad routingCongestion

• “Real” traffic• Unnecessary traffic (broadcasts, multicast, denial of service attacks)

Presenter
Presentation Notes
So what causes these problems? Here’s a laundry-list of the “usual suspects”. First on the list, and most common is a bad host configuration. As we just mentioned, this is usually because operating systems ship tuned to the commercial internet, and we have very different paths over the Internet2 infrastructure (in particular the “bandwidth delay product” is much greater). Second is duplex mismatch, usually due to autoconfiguration failure, with one side believing it is full-duplex (can send and receive simultaneously), and the other side believing it is half-duplex (can only send or receive one at a time). This is a legacy of how the Ethernet standard has evolved. This is the major cause of “non-congestive” packet losses. Wiring or fiber problems can cause non-congestive packet losses. Bad equipment (anything from host interfaces that cannot run full-speed, to host, switch, router, or fiber equipment failure) can cause excessive delays, jitter, or non-congestive packet loss. Bad routing can cause excessive latency, or sometimes jitter due to multiple different length paths being used. Congestion causes varying delays and packet loss.
Page 9: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 9

Strategy

Most problems are local…Test ahead of time!Is there connectivity & reasonable latency? (ping -> OWAMP)Is routing reasonable (traceroute)Is host reasonable (NDT; Web100)Is path reasonable (iperf -> BWCTL)

Presenter
Presentation Notes
OK. This slide illustrates one way to attack a performance problem. First and foremost, if you are planning a demo, or other event, do test ahead of time. If you have a concerned application community, this may mean periodic testing among points close to key equipment. For example, all of the VLBI sites may test among each other. It may also mean periodic testing within your network to points in Abilene, or other campuses you talk to frequently. Now say you have a problem that the periodic testing did not pick up (there are just too many paths to test them all). The first question – do you have connectivity and reasonable latency? Ping will give you round-trip times, assuming it isn’t blocked along the way. We’ll describe a tool, owamp, that measures one-way delay, which allows you to disambiguate problems that might occur asymmetrically – asymmetric routing, asymmetric traffic queuing, a dirty fiber can cause asymmetric problems (since each fiber transmits light in one direction). Are you seeing many losses with these low-rate tests? If so, there’s something terribly wrong. If the latency is not what you expect, there may be a routing problem. The best-known tool is traceroute, and you can use that to make sure the path looks reasonable. It goes through your campus, possibly through a gigapop, across Abilene and down to the other side in a reasonable fashion (not taking a scenic tour of the US, for example). Remember that you have to test in the opposite direction; the Abilene router proxy and traceroute servers can help. Has the host been tuned? Is there potentially a duplex-mismatch one of the local Ethernet connections? Here, running NDT, also to be described today, can point out a series of common problems. NDT itself relies on web100, which instruments the Linux kernel. You might consider installing a web100 machine (or using machines with web100 code); there are additional diagnostics you can run using the web100-provided variables, and the kernel itself is better “out of the box”: it can automatically tune buffers on some TCP connections. If routing looks reasonable, and the host is reasonable, you may have a problem in the path. (Large losses in the low rate tests also indicate path problems, assuming it isn’t a duplex mismatch problem, local congestion--perhaps a denial of service attack--or even broken network hardware on the end system.) Iperf is a tool to run synthetic TCP streams (memory-to-memory) between two machines. Bwctl is a tool that we will talk about today that adds authentication and scheduling to iperf, and allows you to test to multiple points, including midpoints within Abilene.
Page 10: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 10

One Technique: ProblemIsolation via Divide and Conquer

Presenter
Presentation Notes
And, for path problems, the best strategy is usually to “divide and conquer”; test to a midpoint, and see which side the problem is on, and then test to a midpoint on one side, until you’ve exhausted your midpoints and have localized the problem as much as you can. We’re working on tools to automate this process, but for now it’s manual. This picture shows testing to a bunch of points that you have access to.
Page 11: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 11

Outline

Problems, typical causes, diagnostic strategiesExamples showing usage of the tools we’ll be talking about todayEnd-to-End Measurement Infrastructure

Page 12: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 12

Tool Examples

When to use NDT• NDT in action at SC’04

When to use BWCTL• BWCTL in action with e-VLBI

When to use OWAMPOWAMP in action with Abilene

Putting it all together

Page 13: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 13

When to use NDT

When you want to know about last mile and host problemsWhen you want a quick and easy test to provide clues at possible problem causeWhen you want to understand large segments of the path from the host view pointWhen a user wants to test their own host

Presenter
Presentation Notes
Quickest of the Tools to use Client can be used on almost any host Servers are almost always close to some point in the path in Internet2
Page 14: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 14

Technique

Start by testing to the nearest NDT server from each end of the problem path

This will help you with a majority of problems

If test both indicate good performance, test to a distant NDT server

If tests still indicate good performance, suspect a problem in the application, not the host or network.

Page 15: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 15

SC’04 Real Life Example

Booth having trouble getting application to run from Amsterdam to PittsburghTests between Amsterdam SGI and Pittsburgh PC showed throughput limited to < 20 MbpsAssumption is: PC buffers too smallQuestion: How do we set WinXP send/receive buffer

Presenter
Presentation Notes
This begins a section on how the NDT was used to find a problem in a real live situation. SC is the annual Supercomputing conference and it was held in Pittsburgh PA in November 2004. Brief problem description given and sys admins assumption as to what was wrong. Question is what Rich Carlson was asked
Page 16: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 16

SC’04 Determine WinXP info

http://www.dslreports.com/drtcp

Presenter
Presentation Notes
Rich pointed the sys admin to this tool to check/set the Windows buffer size. Note: in Windows, TCP transmit buffer tracks receive buffer, so setting 1 sets them both. This is the output from a laptop computer.
Page 17: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 17

SC’04 Confirm PC settings

DrTCP reported 16 MB buffers, but test program still slow, Q: How to confirm?

Run test to SCInet NDT server (PC has Fast Ethernet Connection)

• Client-to-Server: 90 Mbps• Server-to-Client: 95 Mbps• PC Send/Recv Buffer size: 16 Mbytes (wscale 8)• NDT Send/Recv Buffer Size: 8 Mbytes (wscale 7)• Reported TCP average RTT: 46.2 msec

– approximately 600 Kbytes of data in TCP buffer

• Min buffer size / RTT: 1.3 Gbps

Presenter
Presentation Notes
Results when run on sys Admin’s XP system, but they didn’t believe it as their test program kept saying 8K buffers were being used. Test from XP to NDT on conference network. Report clearly shows that PC buffers are being set to 16 MB. Also note that with that buffer size and the reported RTT, the link should be able to sustain 1.3 Gbs and the limit is the physical 100 Mbps Fast E interface in the PC.
Page 18: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 18

SC’04 Local PC Configured OK

No problem foundAble to run at line rateConfirmed that PC’s TCP buffers were set correctly

Presenter
Presentation Notes
Conclusions from test. Note that we now know that the PC is operating properly.
Page 19: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 19

SC’04 Amsterdam SGI

Run test from remote SGI to SC show floor (SGI is Gigabit Ethernet connected).

Downloaded and built command line tool on SGI IRIX

• Client-to-Server: 17 Mbps• Server-to-Client: 16 Mbps• SGI Send/Recv Buffer size: 256 Kbytes (wscale 3)• NDT Send/Recv Buffer Size: 8 Mbytes (wscale 7)• Average RTT: 106.7 msec• Min Buffer size / RTT: 19 Mbps

Presenter
Presentation Notes
Next run test from remote SGI. This host was located at the admin’s home university in Amsterdam. They needed to download and build the command line client before this test could be done. Note that the SGI has a small (256KB) buffer and that this buffer limits the throughput to ~19 Mbps, which is in line with the measured results. Now we know that the SGI needs to be tuned to run over the transatlantic path User reluctant to make changes to SGI network interface from SC show floor NDT client tool allows application to change buffer (setsockopt() function call)
Page 20: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 20

SC’04 Amsterdam SGI (tuned)

Re-run test from remote SGI to SC show floor with –b # option.

•Client-to-Server: 107 Mbps•Server-to-Client: 109 Mbps•SGI Send/Recv Buffer size: 2 Mbytes (wscale 5)•NDT Send/Recv Buffer Size: 8 Mbytes (wscale 7)

•Reported average RTT: 104 msec•Min Buffer size / RTT: 153.8 Mbps

Presenter
Presentation Notes
This slide show what happens when the user increased the SGI buffer size. Experiments showed that 2 MB was the max value the SGI would allow. It would require a system configuration change to go above this value. Notice that we now get 100 Mbps, which is good enough because the client is only connected vi a FastE link. The admin now knows that the network path can support full line rate between the convention floor and the remote site in Amsterdam.
Page 21: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 21

SC’04 Debugging Results

Team spent over 1 hour looking at Win XP config, trying to verify Buffer size

• 2 tools used gave different resultsSingle NDT test verified this in under 30 seconds

10 minutes to download and install NDT client on SGI

15 minutes to discuss options and run client test with set buffer option

Presenter
Presentation Notes
Observations and conclusions. The admin spent over an hour trying to figure out why the XP host wasn’t operating properly. The tools he was using didn’t give the correct data or the data wasn’t believed because the test results didn’t match to expected results. Highlight that a single NDT test showed that everything was working as configured. It took about 10 minutes to download and build the NDT command line client on the SGI. This was done remotely from the SC convention floor via an ssh terminal session. After running the test it took another 15 minutes or so to evaluate what it meant, what our options were, and how to tell if things would get better if the default buffer sizes were increased.
Page 22: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 22

SC’04 Debugging Results

8 Minutes to find SGI limits and determine maximum allowable buffer setting (2 MB)

Total time 34 minutes to verify problem was with remote servers’ TCP send/receive buffer size

Network path verified but Application still performed poorly until it was also tuned

Presenter
Presentation Notes
Brief summary of where the time went. Note that about ½ hour elapses since the NDT tests began. Contrast this with the 1 hour already spent looking for a non-existent XP config problem. Final note, even after the network path was verified, the user application failed to operate as expected. It was eventually tuned to perform better.
Page 23: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 23

When to use BWCTL

You want to understand segments of the pathYou want to know if each segment can handle flows of a specific sizeYou want to know parameters such as bandwidth, packet loss and latencyTo help design or tune an application based on available performance

Presenter
Presentation Notes
Other reasons to use BWCTL: You do not have access to the end hosts (so you can still get partial information) BWCTL allows testing only between know entities (other allowed BWCTL servers) BWCTL insures that only one test at a time is done between servers BWCTL servers will have been tuned to insure that the results reflect the network performance In order to use BWCTL you need to get access to intermediate servers.
Page 24: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 24

Technique

Divide and Conquer!Look for segments with performance less that required by the application

Page 25: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 25

e-VLBI Case Study

The e-VLBI project needed to move massive amounts of data between a number of sites around the worldThey found that performance from some sites was only in the 1 Mbps rangeThey needed to understand why

Page 26: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 26

e-VBLI test infrastructure

David Lapsley, one of the research engineers, established BWCTL servers at the sites of the project.

•Japan: Kashima Observatory•Sweden: Onsala Observatory•US: Haystack (BOS)

He performed a full mesh of tests between all of the servers

Page 27: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 27

e-VLBI Results #1

They used Abilene nodes to divide the problem pathDavid found that there was considerable packet loss in the area of Haystack ObservatoryWorking with network folk from the area the problem was isolated and resolved

Presenter
Presentation Notes
Infrastructure revealed problems. Abilene nodes were in the middle – so they can act to see which side of the path you should focus on. Problems are often easy (at least conceptually) to fix, once you can focus on the problem segments (Sometimes requires the application of funds (upgrade equipment or links)
Page 28: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 28

e-VLBI Results #2

For one site that was using a commodity Internet only 1 Mbps was regularly seenThe application was changed to locate caching to reduce dependence on that site.

Page 29: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 29

e-VLBI Regular Testing

They found the testing to be very useful in understanding the network status

They established a regular testing scheduleThey established a web site for reporting the results

All researchers can check the network statushttp://web.haystack.mit.edu/staff/dlapsley/tsev7.html

Presenter
Presentation Notes
[[Verify URL still valid]]
Page 30: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 30

When to use OWAMP

Want baseline “heartbeat” informationAsymmetric routes can make problem location more difficultOWAMP can provide detailed performance on one direction in the pathWhen you want to know precise latency informationGood for helping real-time applications

Page 31: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 31

Why use OWAMP

It is very sensitive to minor network changes

•Route changes•Packet queuing

It tells you about one-direction of the path

Presenter
Presentation Notes
It also gives a long-term baseline, so you can note trends and changes.
Page 32: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 32

OWAMP Case Study Queuing on Abilene

Tuesday, 2004-08-17, 16:05-16:20 UTCThat’s 11:05 to 11:20 EDTCaltech to CERN performing 10GEthroughput experiment• Single adapter to date, PCI-X• Theoretical limit of ~8.5 Gbps• Practical limit closer to 7.5 Gbps• Exactly what was tested at that time is unkown“Worst 10” delay list had some larger thannormal variances… to date, software issues

Presenter
Presentation Notes
Background on what we think was going on…
Page 33: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 33

One Links History

The Denver to KSCY Link

Presenter
Presentation Notes
The minimum never varies – typical of a link, unless it is fully congested. The 95th percentile is rising, and occasionally the 50th percentile is rising too – so a “typical” packet will see some queuing. For research links, we ideally want to keep the 95th percentile equal to the minimum, unless very large tests are being run.
Page 34: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 34

What It Shows

Only paths that traverse DNVR>KSCY showed additional delaySome delayed by ~ an extra 35msecProbable cause – Router started queuing packets create a small delayIt tells you that there is congestion on the link.

Page 35: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Example 2 – SCP file transfer

Bob and Carol are collaborating on a project. Bob needs to send a copy of the data (50 MB) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take?

•5 minutes?•1 minute?•5 seconds?

Page 36: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

What should we expect?

Assumptions:•100 Mbps Fast Ethernet is the slowest link•50 msec round trip time

Bob & Carol calculate:•50 MB * 8 = 400 Mbits •400 Mb / 100 Mb/sec = 4 seconds

Page 37: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Initial SCP Test Results

Page 38: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Initial Test Results

This is unacceptable!First look for network infrastructure problem

•Use NDT tester to examine both hosts

Page 39: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Initial NDT testing shows Duplex Mismatch at one end

Page 40: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

NDT Found Duplex Mismatch

Investigating this it is found that the switch port is configured for 100 Mbps Full-Duplex operation.

•Network administrator corrects configuration and asks for re-test

Page 41: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Duplex Mismatch Corrected

Page 42: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

SCP results after Duplex Mismatch Corrected

Page 43: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Intermediate Results

Time dropped from 18 minutes to 40 seconds.But our calculations said it should take 4 seconds!

•400 Mb / 40 sec = 10 Mbps•Why are we limited to 10 Mbps?•Are you satisfied with 1/10th of the possible performance?

Page 44: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Default TCP window settings

Page 45: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Calculating the Window Size

Remember Bob found the round-trip time was 50 msecCalculate window size limit

•85.3KB * 8 b/B = 698777 b•698777 b / .050 s = 13.98 Mbps

Calculate new window size• (100 Mb/s * .050 s) / 8 b/B = 610.3 KB•Use 1MB as a minimum

Page 46: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Resetting Window Value

Page 47: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

With TCP windows tuned

Page 48: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Steps so far

Found and fixed Duplex Mismatch •Network Infrastructure problem

Found and fixed TCP window values•Host configuration problem

Are we done yet?

Page 49: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

SCP results with tuned windows

Page 50: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Intermediate Results

SCP still runs slower than expected•Hint: SCP uses internal buffers•Patch available from PSC

Page 51: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

SCP Results with tuned SCP

Page 52: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Final Results

Fixed infrastructure problemFixed host configuration problemFixed Application configuration problem

•Achieved target time of 4 seconds to transfer 50 MB file over 2000 miles

Page 53: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools

Why is it hard to Find/Fix Problems?

Network infrastructure is complexNetwork infrastructure is sharedNetwork infrastructure consists of multiple components

Page 54: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 54

Outline

Problems, typical causes, diagnostic strategiesExamples showing usage of the tools we’ll be talking about todayEnd-to-End Measurement Infrastructure

Presenter
Presentation Notes
OK. So, we’ve talked about problems. Diagnostic strategies. And we showed the use of some of the tools to implement those strategies. I’d like to take a moment to revisit the end-to-end performance initiative vision, and talk a bit about what campuses can do.
Page 55: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 55

End-to-End Measurement Infrastructure Vision

Ongoing monitoring to test major elements, and end-to-end paths.

•Elements: gigaPoP links, peering, …•Utilization •Delay•Loss•Occasional throughput•Multicast connectivity

Presenter
Presentation Notes
Add separate slide for application communities? Not yet, so say one of the important set of end-to-end paths is the set that a particular application community cares about. Nuclear physics sites. Medical sites. Astronomy sites.
Page 56: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 56

End-to-End Measurement Infrastructure Vision II

Many more end to end paths than can be monitored.Diagnostic tools available on-demand (with authorization)

•Show routes•Perform flow tests (perhaps app tests)•Parse/debug flows (a-la tcpdump or OCXmon with heuristic tools)

Presenter
Presentation Notes
But, because there are many many paths, there will have to be some ability to do tests on the fly. So make them available. In the long run, provide a tool to do most of it for you, and just hand back the results. Probably skip: This is not limited to just the US; we work with folks in Europe, Asia, and elsewhere. So there has to be a way to interoperate. That’s one of the things Internet2 is working on now.
Page 57: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 57

What Campuses Can Do

Export SNMP data• I have an “Internet2 list”, can add you•Monitor loss as well as throughput

Performance test point at campus edge•Hopefully, the result of today’s workshop•Possibly also traceroute “looking glass”•Commercial (e.g., NetIQ) complements•We have a master list

Presenter
Presentation Notes
So, what can you do? A few simple things. Make utilization data available, at least at the edge of your campus. Monitor not only utilization, but things that can cause losses… packet drop and error counters. Placing points at the edge of your campus will allow you to test ad-hoc from within your campus to the edge, and allow you to constantly monitor campus connectivity you think is important. (There will be some more use cases later; one possibility is just making sure your university to university traffic goes over the high-performance network--as long as it’s up). Thanks for listening to the overview.
Page 58: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 58

Strategy (references) (1)

See also•http://e2epi.internet2.edu/Look at stories, documents, tools

• http://e2epi.internet2.edu/ndt/Pointer to the tool, and using it for debugging the last mile

Presenter
Presentation Notes
References for debugging strategies and application design. Flip through these quickly, they are here so participants can look later.
Page 59: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 59

Strategy (references) (2)

•http://www.psc.edu/networking/projects/tcptune/ How to tweak OS parameters (also scp pointer)

•http://www.ncne.org/research/tcp/ TCP debugging the detailed way

•http://dast.nlanr.net/Guides/WritingApps/ Tips for app writers

•http://dast.nlanr.net/Guides/GettingStartedAnd some checking to do by hand & debugging.

Page 60: Finding Network Problems that Influence Applications ...

www.internet2.edu

Page 61: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 61

Acknowledgements

The original presentation by Matt Zekauskas using ideas inspired by material from NLANR DAST, Matt Mathis, and others.Copyright Internet2 2005, All Rights Reserved.

Presenter
Presentation Notes
Just because my name ‘ll probably come off the front if it becomes standard course material
Page 62: Finding Network Problems that Influence Applications ...

22 Mar 2005 v0.4

Background:Detailed Tools Discussion

Page 63: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 63

Bakground: Tools Outline

Tools: First mile, host issuesTools: Path issuesTools: Others to be aware ofTools within Abilene

Presenter
Presentation Notes
Let’s start with tools that check out the hosts, and the network connections near the hosts.
Page 64: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 64

Internet2 Detective

A simple “is there any hope” tool•Windows “tray” application•Red/green lights, am I on Internet2•Multicast available• IPv6 available

http://detective.internet2.edu/

Presenter
Presentation Notes
Very rudimentary (NOTE: Do not read any of the items on this page – flip through quickly)
Page 65: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 65

NLANR Performance Advisor

Geared for the naive userRun at both ends, and see if a standard problem is detected.Can also work with intermediate servershttp://dast.nlanr.net/Projects/Advisor

Presenter
Presentation Notes
I want to mention this one because it’s fresh. But we haven’t had time to extensively evaluate it. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 66: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 66

NDT

Network Debugging ToolJava appletConnects to server in middle, runs tests, and evaluates heuristics looking for host and first mile problems.Has detailed output.You’ll see lots of detail later today.A commercial tool that tests for TCP buffer problems: http://www.dslreports.com/tweaks/

Presenter
Presentation Notes
This is the one we’re actively developing. More on this later today. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 67: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 67

Host/OS Tuning: Web100

Goal: TCP stack, tuning not bottleneckLarge measurement component

•TCP performance not what you expect?Ask TCP why!

–Receiver bottleneck (out of receiver window)–Sender bottleneck (no data to send)–Path bottleneck (out of congestion window)–Path anomalies (duplicate, out of order, loss)

www.web100.org

Presenter
Presentation Notes
As an aside, I mentioned Web100 earlier in a bullet. Here’s what Web100 is, and why you might want to put it on systems you use, if you can. (NOTE: Do not read any of the items on this page – flip through quickly) KEEP THIS TEXT FOR FOLKS WHO DOWNLOAD THE SLIDES: It is a kernel modification, currently to Linux 2.6 series kernels. There is a TCP MIB draft in the IETF to try and standardize the export-TCP-state part of Web100, and we expect Microsoft and others to pick that up. (Microsoft already has some of the elements in recent Windows server versions)
Page 68: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 68

Reference Servers (Beacons)

H.323 conferencing•Goal: portable machines that tell you if system likely to work (and if not, why?)

•Moderate-rate UDP of interest•E.g., H.323 Beaconhttp://www.osc.edu/oarnet/itecohio.net/beacon/

•ViDeNet Scout, http://scout.video.unc.edu/

Presenter
Presentation Notes
Rather than the generic NDT tool, there are also specific tools for videoconferencing. Moderate-rate UDP is a substitute, but the H.323 Beacon from Ohio State (free) and ViDeNet Scout (uses licensed software) actually run the protocol and capture behavior. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 69: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 69

Background: Tools Outline

Tools: First mile, host issuesTools: Path issuesTools: Others to be aware ofTools within Abilene

Presenter
Presentation Notes
OK, what about tools to help us resolve path problems?
Page 70: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 70

OWAMP – Latency/Loss

One-Way Active Measurement ProtocolRequires NTP-Synchronized clocksLook for one-way latency, lossAuthentication and SchedulingAgain, lots more later today

Presenter
Presentation Notes
Here’s a brief description of OWAMP just to get you oriented. More on this tool later today. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 71: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 71

BWCTL -- Throughput

A tool for throughput testing that includes scheduling and authentication.Currently uses iperf for actual tests.Can assign users (or IP addresses) to classes, give classes different throughput limits or time limits.Periodic and on-demand testing.Lots more later today.

Presenter
Presentation Notes
Likewise, a brief description of BWCTL. More on this tool later today, also. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 72: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 72

Background: Tools Outline

Tools: First mile, host issuesTools: Path issuesTools: Others to be aware ofTools within Abilene

Presenter
Presentation Notes
Finally, here are some pointers to other tools that are used. They are here more for reference than for detailed explanation now.
Page 73: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 73

Some Commercial Tools

Caveat: only a partial list, give me more!Spirent (nee Netcom/Adtech):

• SmartBits: test at low & high rates, QoS; test components or end-to-end path

NetIQ: Chariot/PegasusAgilent (like SmartBits, and FireHunter)Ixia (like SmartBits/Spirent)Brix Networks (like AMP/Owamp, for ‘QoS’)Apparent Networks: path debugger

Presenter
Presentation Notes
Here are some commercial tools that we know of. (NOTE: show this slide but don’t read through the items) Spirent makes testers that can rigorously evaluate routers (and paths), and work at line rate. NetIQ has little drones that you can run with a command and control console. It can simulate some application behavior, and also has a capture then replay ability. Agilent makes testers like spirent, and also has a product called FireHunter that is used by ISPs. It does things like pings, and FTP fetches, and Web fetches, and can issue alerts when things go out of spec. Ixia makes boxes like Agilent and Spirent. Brix Networks is interesting, because they make measurement points that you can deploy, and then run tests like owamp among them, as well as other tests specifically design to probe QoS parameters and limits. We already mentioned Apparent Networks.
Page 74: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 74

Some Noncommercial Tools

Iperf: dast.nlanr.net/Projects/iperf• See also http://www-itg.lbl.gov/nettest/ • http://www-didc.lbl.gov/NCS/

Flowscan: • http://www.caida.org/tools/utilities/flowscan/ • http://net.doit.wisc.edu/~plonka/FlowScan/

SLAC’s traceroute perl script:• http://www.slac.stanford.edu/comp/net/wan-mon/traceroute-srv.html

One large list: • http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html

Presenter
Presentation Notes
Here are a couple of noncommercial tools, along with a pointer to a whole bunch more. (NOTE: show this slide but don’t read through the items) The direct pointer to iperf, which bwctl uses. Flowscan is a tool to process netflow output and create pretty aggregate graphs. There is another set of tools, called “flow-tools” from ohio state, that Abilene uses (note: Ohio State is an Internet2 technology evaluation center). SLAC has a perl script that can be used with a web server to provide traceroutes. Les Cottrell and his group at SLAC also has a huge list of tools.
Page 75: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 75

Background: Tools Outline

Tools: First mile, host issuesTools: Path issuesTools: Others to be aware ofTools within Abilene

Presenter
Presentation Notes
OK, let’s see what’s in Abilene.
Page 76: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 76

Abilene:Measurements from the Center

Active (latency, throughput)• Measurement within Abilene• Measurements to the edge

Passive• SNMP stats (esp. core Abilene links)• Variables via router proxy• Router configuration• Route state• Characterization of traffic

–Netflow; OCxMON

Presenter
Presentation Notes
Abilene does both active tests, and passive tests. Of particular interest is the router proxy. You can give mediated commands to the router to query state. This can be very useful. You can also issue traceroutes and pings from the Abilene routers.
Page 77: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 77

Goal

Abilene goal to be an exemplar•Measurements open•Tests possible to router nodes•Throughput tests routinely through backbone

•…as well as existing utilization, etc.•The “Abilene Observatory”http://abilene.internet2.edu/observatory

Presenter
Presentation Notes
Why does Abilene take all these measurements, and publish the results? (NOTE: Do not read ll of the items on this page – flip through quickly)
Page 78: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 78

Abilene: Machines

GigE connected high-performance tester•bwctl, “nms1”, 9000 byte MTU

Latency tester•owamp, “nms4”, 100bT

Stats collection•SNMP, flow-stats, “nms3”, 100bT

Ad-hoc tests•NDT server, “nms2”, gigE, 1500 byte MTU

Presenter
Presentation Notes
We currently have four machines at each router node. Here are their roles. (NOTE: show this slide but don’t read through the items) Add slide probably: 1.4ghz PIII, dual bank. Whatever the chipset. Fairly slow, but fastest we could get with a 48VDC supply off the shelf when we were building Abilene.
Page 79: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 79

Throughput

Take tests 1/hr, 20 seconds each• IPv4 TCP• IPv6 TCP (no discernable difference)• IPv4 UDP (on our platforms flakey at 1G)• IPv6 UDP (ditto)

Others test to our nodesOthers test amongst themselvesNet result: 25% of traffic (NOT capacity) is measurement

Presenter
Presentation Notes
Abilene uses BWCTL for throughput. (NOTE: show this slide but don’t read through the items)
Page 80: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 80

Latency

CDMA used to synchronize NTP•www.endruntechnologies.com

Test among all router node pairs10/secIPv4 and IPv6Minimal sized packetsPoisson schedule

Presenter
Presentation Notes
Abilene uses OWAMP for latency. (NOTE: show this slide but don’t read through the items)
Page 81: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 81

Passive - Utilization

The Abilene NOC takes•Packets in,out•Bytes in,out•Drops/Errors• ..for all interfaces, publishes internal links & peering points (at 5 min intervals)

• ..via SNMP polling – every 60 sechttp://loadrunner.uits.iu.edu/weathermaps/abilene/abilene.html

Presenter
Presentation Notes
A word on Abilene’s utilization. (NOTE: Do not read any of the items on this page – flip through quickly)
Page 82: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 82

Presenter
Presentation Notes
And you’ve all seen the weathermap. This is an older version (no Chicago!)
Page 83: Finding Network Problems that Influence Applications ...

Finding Network Problems: Measurement Tools V0.4 22-Mar-2005 83

Abilene Pointers

http://www.abilene.iu.edu/ •Monitoring•Tools

http://www.itec.oar.net/abilene-netflow http://netflow.internet2.edu/weekly/ (summaries)

Presenter
Presentation Notes
There are lots of tools at the NOC page Netflow data is currently at Ohio State And we make weekly summaries to try and understand what traffic is passing over the network, and watch for trends.