Troubleshooting Network Performance Issues with Active Monit(PDF)

Troubleshoo*ng Network Performance Issues with Ac*ve Monitoring

NANOG 64 Brian Tierney, ESnet, bl*erney@es.net

June 2, 2015

This document is a result of work by the perfSONAR Project (hNp://www.perfsonar.net) and is licensed under CC BY-‐SA 4.0 (hNps://crea*vecommons.org/licenses/by-‐sa/4.0/).

Who Am I? •  My Background – Lawrence Berkeley Na*onal Lab – ESnet – Big Science Data

6/2/15 2

Lawrence Berkeley Na*onal Laboratory

6/2/15 3

Energy Sciences Network

• Connects Department of Energy Na*onal Laboratories to universi*es and research ins*tu*ons around the world • Many sites with 100G connec*ons to ESnet today -‐ Berkeley, Livermore, Stanford, Fermi, Brookhaven, Oakridge, Argonne

6/2/15 4

ESnet / DOE Na,onal Lab Network Profile

•  Small-‐ish numbers of very large flows over very long distances: –  Between California, Illinois, New York, Tennessee, Switzerland

•  High-‐speed “Access” links -‐ 100G sites connected to 100G core •  Nx10G hosts, future Nx40G hosts, dedicated to Data Transfer •  GridFTP / Globus Online / Parallel FTP •  LHC detectors to data centers around the world (future 180Gbps) •  Electron microscopes to supercomputers (20k – 100k FPS per camera)

6/2/15 5

Introduc*on & Mo*va*on NANOG 64

Brian Tierney, ESnet

bl*erney@es.net June 2, 2015

perfSONAR Goals •  Originally developed to help troubleshoot large science data transfers –  E.g.: LHC at CERN; Large genomic data sets, etc

•  Useful for any network doing large file transfers. E.g.: –  Into Amazon EC2 –  Large HD movies (Hollywood post produc*on) – Many new “big data” applica*ons

6/2/15 7

What is perfSONAR? •  perfSONAR is a tool to:

–  Set (hopefully raise) network performance expecta*ons –  Find network problems (“sok failures”) –  Help fix these problems

•  All in mul*-‐domain environments •  These problems are all harder when mul*ple networks are involved

•  perfSONAR is provides a standard way to publish ac*ve and passive monitoring data –  This data is interes*ng to network researchers as well as network

operators

6/2/15 8

Problem Statement •  In prac*ce, performance issues are prevalent and distributed. •  When a network is underperforming or errors occur, it is difficult to iden*fy

the source, as problems can happen anywhere, in any domain. •  Local-‐area network tes*ng is not sufficient, as errors can occur between

networks.

6/2/15 9

Where Are The Problems?

Source Campus

Backbone

Congested or faulty links between domains

Congested intra-‐ campus links

Des*na*on Campus

Latency dependant problems inside domains with small RTT

Regional

6/2/15 10

Source Campus

R&E Backbone

Regional

Des,na,on Campus

Regional

Performance is good when RTT is < ~10 ms

Performance is poor when RTT exceeds ~10 ms

Switch with small buffers

Local Tes*ng Will Not Find Everything

6/2/15 11

A small amount of packet loss makes a huge difference in TCP performance

Metro Area

Local (LAN)

Regional

Con*nental

Interna*onal

Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss)

With loss, high performance beyond metro distances is essentially impossible

6/2/15 12

Sok Network Failures •  Sok failures are where basic connec*vity func*ons, but

high performance is not possible. •  TCP was inten*onally designed to hide all transmission

errors from the user: –  “As long as the TCPs con*nue to func*on properly and the

internet system does not become completely par**oned, no transmission errors will affect the users.” (From IEN 129, RFC 716)

•  Some sok failures only affect high bandwidth long RTT flows.

•  Hard failures are easy to detect & fix –  sok failures can lie hidden for years!

•  One network problem can oken mask others •  Hardware counters can lie!

6/2/15 13

Hard vs. Sok Failures •  “Hard failures” are the kind of problems every organiza*on

understands –  Fiber cut –  Power failure takes down routers –  Hardware ceases to func*on

•  Classic monitoring systems are good at aler*ng hard failures –  i.e., NOC sees something turn red on their screen –  Engineers paged by monitoring systems

•  “Sok failures” are different and oken go undetected –  Basic connec*vity (ping, traceroute, web pages, email) works –  Performance is just poor

•  How much should we care about sok failures? 6/2/15 14

6/2/15 15

Causes of Packet Loss •  Network Conges*on

•  Easy to confirm via SNMP, easy to fix with $$ •  This is not a ‘sok failure’, but just a network capacity issue •  Oken people assume conges*on is the issue when it fact it is not.

•  Under-‐buffered switch dropping packets •  Hard to confirm

•  Under-‐powered firewall dropping packets •  Hard to confirm

•  Dirty fibers or connectors, failing op*cs/light levels •  Some*mes easy to confirm by looking at error counters in the routers

•  Overloaded or slow receive host dropping packets •  Easy to confirm by looking at CPU load on the host

6/2/15 16

Sample Sok Failure: failing op*cs

normal performance

degrading performance

one month

repair

6/2/15 17

Sample Sok Failure: Under-‐powered Firewall

Inside the firewall

•  One direc*on severely impacted by firewall

•  Not useful for science data

Outside the firewall

•  Good performance in both direc*ons

6/2/15 18

Sample Sok Failure: Host Tuning

6/2/15 19

Average TCP results, various switches • Buffers per 10G egress port, 2x parallel TCP streams, • 50ms simulated RTT, 2Gbps UDP background traffic

1MB Brocade MLXe1

9MB Arista 7150

16MB Cisco 6704

64MB Brocade MLXe1

90MB Cisco 67162

VOQ Arista 7504

200MB Cisco 67163

Sok Failure: Under-‐buffered Switches

6/2/15 20

What is perfSONAR? •  perfSONAR is a tool to:

•  Set network performance expecta*ons •  Find network problems (“sok failures”) •  Help fix these problems •  All in mul*-‐domain environments

•  These problems are all harder when mul*ple networks are involved •  perfSONAR is provides a standard way to publish ac*ve and passive

monitoring data •  perfSONAR = Measurement Middleware

–  You can’t fix what you can’t find –  You can’t find what you can’t see –  perfSONAR lets you see

6/2/15 21

perfSONAR History •  perfSONAR can trace its origin to the Internet2 “End 2 End performance Ini*a*ve”

from the year 2000. •  What has changed since 2000?

–  The Good News: •  TCP is much less fragile; Cubic is the default CC alg, autotuning is and larger TCP buffers are

everywhere •  Reliable parallel transfers via tools like Globus Online •  High-‐performance UDP-‐based commercial tools like Aspera •  more good news in latest Linux kernel, but it will take 3-‐4 years before this is widely deployed

–  The Bad News: •  The wizard gap is s*ll large •  Jumbo frame use is s*ll small •  Under-‐buffered and switches and routers are s*ll common •  Under-‐powered/misconfigured firewalls are common •  Sok failures s*ll go undetected for months •  User performance expecta*ons are s*ll too low

6/2/15 22

The perfSONAR collabora*on •  The perfSONAR collabora*on is a Open Source project lead by ESnet,

Internet2, Indiana University, and GEANT. –  Each organiza*on has commiNed 1.5 FTE effort to the project –  Plus addi*onal help from many others in the community (OSG, RNP, SLAC, and more)

•  The perfSONAR Roadmap is influence by –  requests on the project issue tracker –  annual user surveys sent to everyone on the user list –  regular mee*ngs with VO using perfSONAR such as the WLCG and OSG –  discussions at various perfSONAR related workshops

•  Based on the above, every 6-‐12 months the perfSONAR governance group meets to priori*ze features based on: –  impact to the community –  level of effort required to implement and support –  availability of someone with the right skill set for the task 6/2/15 23

perfSONAR Toolkit •  The “perfSONAR Toolkit” is an open source implementa*on

and packaging of the perfSONAR measurement infrastructure and protocols –  hNp://www.perfsonar.net

•  All components are available as RPMs, and bundled into a CentOS 6-‐based “ne*nstall” (Debian support in next release)

•  perfSONAR tools are much more accurate if run on a dedicated perfSONAR host, not on the data server

•  Easy to install and configure •  Usually takes less than 30 minutes

6/2/15 24

perfSONAR Toolkit Components •  BWCTL for scheduling periodic throughout (iperf, iperf3), ping and

traceroute tests •  OWAMP for measuring one-‐way latency and packet loss (RFC4856) •  “esmond” for storing measurements, summarizing results and making

available via REST API •  Lookup service for discovering other testers •  On-‐demand test tools such as NDT (Network Diagnos*c Test) •  Configura*on GUIs to assist with managing all of the above •  Graphs to display measurement results

6/2/15 25

Who is running perfSONAR? •  Currently over 1400 deployments world-‐wide

hJp://stats.es.net/ServicesDirectory/ 6/2/15 26

perfSONAR Deployment Growth

6/2/15 27

Ac*ve vs. Passive Monitoring

Passive monitoring systems have limita*ons •  Performance problems are oken only visible at the ends •  Individual network components (e.g. routers) have no knowledge of

end-‐to-‐end state •  perfSONAR tests the network in ways that passive monitoring systems

do not More perfSONAR hosts = beNer network visibility •  There are now enough perfSONAR hosts in the world to be quite useful

6/2/15 28

perfSONAR Dashboard: Raising Expecta,ons and

improving network visibility

Status at-‐a-‐glance •  Packet loss •  Throughput •  Correctness Current live instances at: •  hNp://ps-‐dashboard.es.net/ •  And many more Drill-‐down capabili*es: •  Test history between hosts •  Ability to correlate with other

events •  Very valuable for fault

localiza*on and isola*on

6/2/15 29

perfSONAR Hardware •  These days you can get a good 1U host capable of

pushing 10Gbps TCP for around $500 (+10G NIC cost). –  See perfSONAR user list

•  And you can get a host capable of 1G for around $100! –  Intel Celeron-‐based (ARM is not fast enough) –  e.g.: hNp://www.newegg.com/Product/Product.aspx?Item=N82E16856501007

•  VMs are not recommended –  Tools work beNer if can guarantee NIC isola*on

6/2/15 30

Ac*ve and Growing perfSONAR Community

•  Ac*ve email lists and forums provide: –  Instant access to advice and exper*se

from the community. –  Ability to share metrics, experience

and findings with others to help debug issues on a global scale.

•  Joining the community automa*cally increases the reach and power of perfSONAR –  The more endpoints means

exponen*ally more ways to test and discover issues, compare metrics

6/2/15 31

•  The perfSONAR collabora*on is working to build a strong user community to support the use and development of the sokware.

•  perfSONAR Mailing Lists –  Announcement Lists:

•  hNps://mail.internet2.edu/wws/subrequest/perfsonar-‐announce

–  Users List: •  hNps://mail.internet2.edu/wws/subrequest/perfsonar-‐users

perfSONAR Community

6/2/15 32

Use Cases & Success Stories NANOG 64

Brian Tierney, ESnet, bl*erney@es.net

June 2, 2015

Success Stories -‐ Failing Op*c(s) Gb

normal performance

degrading performance

repair

one month

6/2/15 34

Success Stories -‐ Failing Op*c(s) •  Adding an ANenuator to a Noisy (discovered via OWAMP) Link

6/2/15 35

Success Stories – Firewall trouble

6/2/15 36

6/2/15

•  Results to host behind the firewall:

6/2/15

•  In front of the firewall:

Success Stories – Firewall trouble •  Want more proof – lets look at a measurement tool through the firewall.

–  Measurement tools emulate a well behaved applica*on •  ‘Outbound’, not filtered:

nuttcp -T 10 -i 1 -p 10200 bwctl.newy.net.internet2.edu! 92.3750 MB / 1.00 sec = 774.3069 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.2879 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.3019 Mbps 0 retrans! 111.7500 MB / 1.00 sec = 938.1606 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.3198 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.2653 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.1931 Mbps 0 retrans! 111.9375 MB / 1.00 sec = 938.4808 Mbps 0 retrans! 111.6875 MB / 1.00 sec = 937.6941 Mbps 0 retrans! 111.8750 MB / 1.00 sec = 938.3610 Mbps 0 retrans!! 1107.9867 MB / 10.13 sec = 917.2914 Mbps 13 %TX 11 %RX 0 retrans 8.38 msRTT!

6/2/15 39

•  ‘Inbound’, filtered:

nuttcp -r -T 10 -i 1 -p 10200 bwctl.newy.net.internet2.edu! 4.5625 MB / 1.00 sec = 38.1995 Mbps 13 retrans! 4.8750 MB / 1.00 sec = 40.8956 Mbps 4 retrans! 4.8750 MB / 1.00 sec = 40.8954 Mbps 6 retrans! 6.4375 MB / 1.00 sec = 54.0024 Mbps 9 retrans! 5.7500 MB / 1.00 sec = 48.2310 Mbps 8 retrans! 5.8750 MB / 1.00 sec = 49.2880 Mbps 5 retrans! 6.3125 MB / 1.00 sec = 52.9006 Mbps 3 retrans! 5.3125 MB / 1.00 sec = 44.5653 Mbps 7 retrans! 4.3125 MB / 1.00 sec = 36.2108 Mbps 7 retrans! 5.1875 MB / 1.00 sec = 43.5186 Mbps 8 retrans!! 53.7519 MB / 10.07 sec = 44.7577 Mbps 0 %TX 1 %RX 70 retrans 8.29 msRTT!

6/2/15 40

Success Stories -‐ Host Tuning •  long path (~70ms), single stream TCP, 10G cards, tuned hosts •  Why the nearly 2x up*ck? Adjusted net.ipv4.tcp_rmem/wmem maximums

(used in auto tuning) to 64M instead of 16M.

6/2/15 41

•  perfSONAR can't fix fiber cuts, but you can see the loss event and the latency change due to traffic re-‐rou*ng

Success Stories -‐ Fiber Cut

6/2/15 42

Success Stories -‐ Monitoring TA Links

6/2/15 43

Success Stories -‐ MTU Changes

6/2/15 44

Success Stories -‐ Host NIC Speed Mismatch

hNp://fasterdata.es.net/performance-‐tes*ng/troubleshoo*ng/interface-‐speed-‐mismatch/ hNp://fasterdata.es.net/performance-‐tes*ng/evalua*ng-‐network-‐performance/impedence-‐mismatch/

Sending from a 10G host to a 1G host leads to unstable performance

6/2/15 45

Another Failing Line Card Example

6/2/15 46

Success Stories Summary •  Key Message(s): •  This type of ac*ve monitoring find problems in the network

and on the end hosts. •  Ability to view long term trends in latency, loss, and

throughput is very useful •  Ability to show someone a plot, and say “what did you do on Day X? It

broke something.” •  Ability to show impact of an upgrade (or lack thereof).

6/2/15 47

Deployment & Advanced Regular Tes*ng Strategies

NANOG 64

Brian Tierney, ESnet, bl*erney@es.net June 2, 2015

This document is a result of work by the perfSONAR Project (hNp://www.perfsonar.net) and is licensed under CC BY-‐SA 4.0

(hNps://crea*vecommons.org/licenses/by-‐sa/4.0/).

•  We can’t wait for users to report problems and then fix them (sok failures can go unreported for years!)

•  Things just break some*mes –  Failing op*cs –  Somebody messed around in a patch panel and kinked a fiber –  Hardware goes bad

•  Problems that get fixed have a way of coming back –  System defaults come back aker hardware/sokware upgrades –  New employees may not know why the previous employee set things

up a certain way and back out fixes •  Important to con*nually collect, archive, and alert on ac*ve

throughput test results

Importance of Regular Tes*ng

6/2/15 49

A perfSONAR Dashboard

6/2/15 50

hNp://ps-‐dashboard.es.net (google “perfSONAR Maddash” for other public dashboards)

perfSONAR Deployment Loca*ons

•  perfSONAR hosts are most useful if they are deployed near key resources such as data servers

•  More perfSONAR hosts allow segments of the path to be tested separately –  Reduced visibility for devices between perfSONAR hosts

– Must rely on counters or other means where perfSONAR can’t go

6/2/15 51

Example perfSONAR Host Placement

6/2/15 52

Regular perfSONAR Tests •  We run regular tests to check for three things

–  TCP throughput –  One way packet loss and delay –  traceroute

•  perfSONAR has mechanisms for managing regular tes*ng between perfSONAR hosts –  Sta*s*cs collec*on and archiving –  Graphs –  Dashboard display –  Integra*on with NAGIOS

•  Many public perfSONAR hosts around the world that you may be able to test against

6/2/15 53

Regular Tes*ng •  Op*ons: –  Beacon: Let others test to you

•  No regular tes*ng configura*on is needed –  Island: Pick some hosts to test to – you store the data locally. •  No coordina*on with others is needed

– Mesh: full coordina*on between you and others •  Use a tes*ng configura*on that includes tests to everyone, and incorporate into a dashboard

6/2/15 54

Regular Tes*ng -‐ Beacon

•  The beacon setup is typically employed by a network provider (regional, backbone, exchange point) –  A service to the users (allows

people to test into the network) –  Can be configured with Layer 2

connec*vity if needed –  If no regular tests are scheduled,

minimum requirements for local storage.

–  Makes the most sense to enable all services (bandwidth and latency)

6/2/15 55

Regular Tes*ng -‐ Island

•  The island setup allows a site to test against any number of the perfSONAR nodes around the world, and store the data locally. –  No coordina*on required with

other sites –  Allows a view of near horizon

tes*ng (e.g. short latency – campus, regional) and far horizon (backbone network, remote collaborators).

–  OWAMP is par*cularly useful for determining packet loss in the previous cases.

6/2/15

Regular Tes*ng -‐ Mesh

•  A full mesh requires more coordina*on: –  A full mesh means all hosts

involved are running the same test configura*on

–  A par*al mesh could mean only a small number of related hosts are running a tes*ng configura*on

•  In either case – bandwidth and latency will be valuable test cases

6/2/15

•  What are you going to measure? –  Achievable bandwidth

•  2-‐3 regional desGnaGons •  4-‐8 important collaborators •  4-‐6 *mes per day to each des*na*on •  10-‐20 second tests within a region, maybe longer across oceans and con*nents

–  Loss/Availability/Latency •  OWAMP: ~10-‐20 collaborators over diverse paths

–  Interface U*liza*on & Errors (via SNMP) •  What are you going to do with the results?

–  NAGIOS Alerts –  Reports to user community? –  Internal and/or external Dashboard

Develop a Test Plan

6/2/15 58

hNp://stats.es.net/ServicesDirectory/

6/2/15 59

Common Ques*ons •  Q: Do most perfSONAR hosts accept tests from anyone?

–  A: Not most, but some do. ESnet allows owamp and iperf TCP tests from all R&E address space, but not the commercial internet.

•  Q: Will perfSONAR test traffic step on my produc*on traffic? –  A: TCP is designed to be friendly; ESnet tags all our internal tests as

‘scavenger’ to ensure tests traffic is dropped first. But too much tes*ng can impact produc*on traffic. Need to be careful of this.

•  Q: How can I control who can run tests to my host? –  A: router ACLs, host ACLs, bwctld.limits –  Future version of bwctl will include even more op*ons to limit tes*ng

•  only allow iperf between 12am and 6am •  only allow 2 tests per day from all but the following subnets

6/2/15 60

Use of Measurement Tools NANOG 64

Brian Tierney, ESnet, bl*erney@es.net

June 2, 2015

Tool Usage •  All of the previous examples were discovered, debugged, and

corrected through the aide of the tools that are on the perfSONAR toolkit •  perfSONAR specific tools

•  bwctl •  owamp

•  Standard Unix tools •  ping, traceroute, tracepath

•  Standard Unix add-‐on tools •  Iperf, iperf3, nuNcp

6/2/15 62

Default perfSONAR Throughput Tool: iperf3

•  iperf3 (hNp://sokware.es.net/iperf/) is a new implementa*on if iperf from scratch, with the goal of a smaller, simpler code base, and a library version of the func*onality that can be used in other programs.

•  Some new features in iperf3 include: –  reports the number of TCP packets that were retransmiNed and CWND –  reports the average CPU u*liza*on of the client and server (-‐V flag) –  support for zero copy TCP (-‐Z flag) –  JSON output format (-‐J flag) –  “omit” flag: ignore the first N seconds in the results

•  On RHEL-‐based hosts, just type ‘yum install iperf3’ •  More at:

hNp://fasterdata.es.net/performance-‐tes*ng/network-‐troubleshoo*ng-‐tools/iperf-‐and-‐iperf3/

6/2/15 63

Sample iperf3 output on lossy network •  Performance is < 1Mbps due to heavy packet loss

[ ID] Interval Transfer Bandwidth Retr Cwnd![ 15] 0.00-1.00 sec 139 MBytes 1.16 Gbits/sec 257 33.9 KBytes ![ 15] 1.00-2.00 sec 106 MBytes 891 Mbits/sec 138 26.9 KBytes ![ 15] 2.00-3.00 sec 105 MBytes 881 Mbits/sec 132 26.9 KBytes ![ 15] 3.00-4.00 sec 71.2 MBytes 598 Mbits/sec 161 15.6 KBytes ![ 15] 4.00-5.00 sec 110 MBytes 923 Mbits/sec 123 43.8 KBytes ![ 15] 5.00-6.00 sec 136 MBytes 1.14 Gbits/sec 122 58.0 KBytes ![ 15] 6.00-7.00 sec 88.8 MBytes 744 Mbits/sec 140 31.1 KBytes ![ 15] 7.00-8.00 sec 112 MBytes 944 Mbits/sec 143 45.2 KBytes ![ 15] 8.00-9.00 sec 119 MBytes 996 Mbits/sec 131 32.5 KBytes ![ 15] 9.00-10.00 sec 110 MBytes 923 Mbits/sec 182 46.7 KBytes !

6/2/15 64

BWCTL •  BWCTL is the wrapper around all the perfSONAR tools •  Policy specifica*on can do things like prevent tests to subnets, or allow longer tests to others. See the man pages for more details

•  Some general notes: –  Use ‘-‐c’ to specify a ‘catcher’ (receiver) –  Use ‘-‐s’ to specify a ‘sender’ – Will default to IPv6 if available (use -‐4 to force IPv4 as needed, or specify things in terms of an address if your host names are dual homed)

6/2/15 65

bwctl features •  BWCTL lets you run any of the following between any 2 perfSONAR nodes: –  iperf3, nuNcp, ping, owping, traceroute, and tracepath

•  Sample Commands: •  bwctl -c psmsu02.aglt2.org -s elpa-pt1.es.net -T iperf3!•  bwping -s atla-pt1.es.net -c ga-pt1.es.net!•  bwping -E -c www.google.com!•  bwtraceroute -T tracepath -c lbl-pt1.es.net -l 8192 -s

atla-pt1.es.net!•  bwping -T owamp -s atla-pt1.es.net -c ga-pt1.es.net -N 1000

-i .01!

6/2/15 66

Host Issues: Throughput is dependent on TCP Buffers

•  A prequel to using BWCTL throughput tests – The Bandwidth Delay Product –  The amount of “in flight” data allowed for a TCP connec*on (BDP = bandwidth

* round trip *me) –  Example: 1Gb/s cross country, ~100ms

•  1,000,000,000 b/s * .1 s = 100,000,000 bits •  100,000,000 / 8 = 12,500,000 bytes •  12,500,000 bytes / (1024*1024) ~ 12MB

–  Most OSs have a default TCP buffer size of 4MB per flow. •  This is too small for long paths •  More details at hNps://fasterdata.es.net/host-‐tuning/

•  Host admins need to make sure there is enough TCP buffer space to keep the sender from wai*ng for acknowledgements from the receiver

6/2/15 67

Network Throughput •  Start with a defini*on:

–  network throughput is the rate of successful message delivery over a communica*on channel –  Easier terms: how much data can I shovel into the network for some given amount of *me

•  What does this tell us? –  Opposite of u*liza*on (e.g. its how much we can get at a given point in *me, minus what is

u*lized) –  U*liza*on and throughput added together are capacity

•  Tools that measure throughput are a simula*on of a real work use case (e.g. how well might bulk data movement perform)

•  Ways to game the system –  Parallel streams –  Manual window size adjustments –  ‘memory to memory’ tes*ng – no disk involved

6/2/15 68

Throughput Test Tools •  Varie*es of throughput tools that BWCTL knows how to manage:

–  Iperf (v2) •  Default for the command line (e.g. bwctl –c HOST will invoke this) •  Some known behavioral problems (CPU hog, hard to get UDP tes*ng to be

correct) –  Iperf3

•  Default for the perfSONAR regular tes*ng framework, can invoke via command line switch (bwctl –T iperf3 –c HOST)

–  NuNcp •  Different code base, can invoke via command line switch (bwctl –T nuNcp –c

HOST) •  More control over how the tool behaves on the host (bind to CPU/core, etc.) •  Similar feature set to iperf3

6/2/15 69

BWCTL Example (iperf) > bwctl -T iperf -f m -t 10 -i 2 -c sunn-pt1.es.net!bwctl: 83 seconds until test results available!bwctl: exec_line: /usr/bin/iperf -B 198.129.254.58 -s -f m -m -p 5136 -t 10 -i 2.000000!

bwctl: run_tool: tester: iperf!bwctl: run_tool: receiver: 198.129.254.58!

bwctl: run_tool: sender: 198.124.238.34!bwctl: start_tool: 3598657357.738868!------------------------------------------------------------!

Server listening on TCP port 5136!Binding to local address 198.129.254.58!

TCP window size: 0.08 MByte (default)!------------------------------------------------------------![ 16] local 198.129.254.58 port 5136 connected with 198.124.238.34 port 5136!

[ ID] Interval Transfer Bandwidth![ 16] 0.0- 2.0 sec 90.4 MBytes 379 Mbits/sec![ 16] 2.0- 4.0 sec 689 MBytes 2891 Mbits/sec!

[ 16] 4.0- 6.0 sec 684 MBytes 2867 Mbits/sec![ 16] 6.0- 8.0 sec 691 MBytes 2897 Mbits/sec!

[ 16] 8.0-10.0 sec 691 MBytes 2898 Mbits/sec![ 16] 0.0-10.0 sec 2853 MBytes 2386 Mbits/sec![ 16] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)!

This is what perfSONAR Graphs – the average of the complete test

6/2/15 70

BWCTL Example (iperf3) > bwctl -T iperf3 -t 10 -i 2 -c sunn-pt1.es.net!!bwctl: run_tool: tester: iperf3!

bwctl: run_tool: receiver: 198.129.254.58!bwctl: run_tool: sender: 198.124.238.34!

bwctl: start_tool: 3598657653.219168!Test initialized!Running client!

Connecting to host 198.129.254.58, port 5001![ 17] local 198.124.238.34 port 34277 connected to 198.129.254.58 port 5001!

[ ID] Interval Transfer Bandwidth Retransmits![ 17] 0.00-2.00 sec 430 MBytes 1.80 Gbits/sec 2![ 17] 2.00-4.00 sec 680 MBytes 2.85 Gbits/sec 0!

[ 17] 4.00-6.00 sec 669 MBytes 2.80 Gbits/sec 0![ 17] 6.00-8.00 sec 670 MBytes 2.81 Gbits/sec 0![ 17] 8.00-10.00 sec 680 MBytes 2.85 Gbits/sec 0!

[ ID] Interval Transfer Bandwidth Retransmits! Sent!

[ 17] 0.00-10.00 sec 3.06 GBytes 2.62 Gbits/sec 2! Received![ 17] 0.00-10.00 sec 3.06 GBytes 2.63 Gbits/sec!

iperf Done.!bwctl: stop_tool: 3598657664.995604!

6/2/15 71

BWCTL Example (nuNcp) > bwctl -T nuttcp -f m -t 10 -i 2 -c sunn-pt1.es.net!nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 198.129.254.58!nuttcp-t: connect to 198.129.254.58 with mss=8948, RTT=62.418 ms!

nuttcp-t: send window size = 98720, receive window size = 87380!nuttcp-t: available send window = 74040, available receive window = 65535!

nuttcp-r: buflen=65536, nstream=1, port=5001 tcp!nuttcp-r: send window size = 98720, receive window size = 87380!nuttcp-r: available send window = 74040, available receive window = 65535!

131.0625 MB / 2.00 sec = 549.7033 Mbps 1 retrans! 725.6250 MB / 2.00 sec = 3043.4964 Mbps 0 retrans!

715.0000 MB / 2.00 sec = 2998.8284 Mbps 0 retrans! 714.3750 MB / 2.00 sec = 2996.4168 Mbps 0 retrans! 707.1250 MB / 2.00 sec = 2965.8349 Mbps 0 retrans!

nuttcp-t: 2998.1379 MB in 10.00 real seconds = 307005.08 KB/sec = 2514.9856 Mbps!nuttcp-t: 2998.1379 MB in 2.32 CPU seconds = 1325802.48 KB/cpu sec!nuttcp-t: retrans = 1!

nuttcp-t: 47971 I/O calls, msec/call = 0.21, calls/sec = 4797.03!nuttcp-t: 0.0user 2.3sys 0:10real 23% 0i+0d 768maxrss 0+2pf 156+28csw!

!nuttcp-r: 2998.1379 MB in 10.07 real seconds = 304959.96 KB/sec = 2498.2320 Mbps!nuttcp-r: 2998.1379 MB in 2.36 CPU seconds = 1301084.31 KB/cpu sec!

nuttcp-r: 57808 I/O calls, msec/call = 0.18, calls/sec = 5742.21!nuttcp-r: 0.0user 2.3sys 0:10real 23% 0i+0d 770maxrss 0+4pf 9146+24csw! 6/2/15 72

BWCTL Example (nuNcp, [1%] loss) > bwctl -T nuttcp -f m -t 10 -i 2 -c sunn-pt1.es.net!bwctl: exec_line: /usr/bin/nuttcp -vv -p 5004 -i 2.000000 -T 10 -t 198.129.254.58!nuttcp-t: buflen=65536, nstream=1, port=5004 tcp -> 198.129.254.58!nuttcp-t: connect to 198.129.254.58 with mss=8948, RTT=62.440 ms!nuttcp-t: send window size = 98720, receive window size = 87380!nuttcp-t: available send window = 74040, available receive window = 65535!nuttcp-r: send window size = 98720, receive window size = 87380!nuttcp-r: available send window = 74040, available receive window = 65535! 6.3125 MB / 2.00 sec = 26.4759 Mbps 27 retrans! 3.5625 MB / 2.00 sec = 14.9423 Mbps 4 retrans! 3.8125 MB / 2.00 sec = 15.9906 Mbps 7 retrans! 4.8125 MB / 2.00 sec = 20.1853 Mbps 13 retrans! 6.0000 MB / 2.00 sec = 25.1659 Mbps 7 retrans!nuttcp-t: 25.5066 MB in 10.00 real seconds = 2611.85 KB/sec = 21.3963 Mbps!nuttcp-t: 25.5066 MB in 0.01 CPU seconds = 1741480.37 KB/cpu sec!nuttcp-t: retrans = 58!nuttcp-t: 409 I/O calls, msec/call = 25.04, calls/sec = 40.90!nuttcp-t: 0.0user 0.0sys 0:10real 0% 0i+0d 768maxrss 0+2pf 51+3csw!!nuttcp-r: 25.5066 MB in 10.30 real seconds = 2537.03 KB/sec = 20.7833 Mbps!nuttcp-r: 25.5066 MB in 0.02 CPU seconds = 1044874.29 KB/cpu sec!nuttcp-r: 787 I/O calls, msec/call = 13.40, calls/sec = 76.44!nuttcp-r: 0.0user 0.0sys 0:10real 0% 0i+0d 770maxrss 0+4pf 382+0csw!!

6/2/15 73

Throughput Expecta*ons Q: What iperf through should you expect to see on a uncongested 10Gbps network? A: 3 -‐ 9.9 Gbps, depending on –  RTT –  TCP tuning –  CPU core speed, and ra*o of sender speed to receiver speed

6/2/15 74

OWAMP •  OWAMP = One Way Ac*ve Measurement Protocol

–  E.g. ‘one way ping’ •  Some differences from tradi*onal ping:

–  Measure each direc*on independently (recall that we oken see things like conges*on occur in one direc*on and not the other)

–  Uses small evenly spaced groupings of UDP (not ICMP) packets –  Ability to ramp up the interval of the stream, size of the packets, number of packets

•  OWAMP is most useful for detec*ng packet train abnormali*es on an end to end basis –  Loss –  Duplica*on –  Out of order packets –  Latency on the forward vs. reverse path –  Number of Layer 3 hops

•  Does require some accurate *me via NTP – the perfSONAR toolkit does take care of this for you.

6/2/15 75

What OWAMP Tells Us •  OWAMP is very useful in regular tes*ng

–  Conges*on or queuing oken occurs in a single direc*on –  Packet loss informa*on (and how oken/how much occurs over *me) is

more valuable than throughput •  This gives you a ‘why’ to go with an observa*on.

–  If your router is going to drop a 50B UDP packet, it is most certainly going to drop a 1500B/9000B TCP packet

•  Overlaying data –  Compare your throughput results against your OWAMP – do you see

paNerns? –  Alarm on each, if you are alarming (and we hope you are alarming …)

6/2/15 76

OWAMP (ini*al) > owping sunn-owamp.es.net!Approximately 12.6 seconds until results available!--- owping statistics from [wash-owamp.es.net]:8885 to [sunn-owamp.es.net]:8827 ---!SID: !c681fe4ed67f1b3e5faeb249f078ec8a!first: !2014-01-13T18:11:11.420!last:!2014-01-13T18:11:20.587!100 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 31/31.1/31.7 ms, (err=0.00201 ms)!one-way jitter = 0 ms (P95-P50)!Hops = 7 (consistently)!no reordering!!--- owping statistics from [sunn-owamp.es.net]:9027 to [wash-owamp.es.net]:8888 ---!SID: !c67cfc7ed67f1b3eaab69b94f393bc46!first: !2014-01-13T18:11:11.321!last:!2014-01-13T18:11:22.672!100 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 31.4/31.5/32.6 ms, (err=0.00201 ms)!one-way jitter = 0 ms (P95-P50)!Hops = 7 (consistently)!no reordering!

6/2/15 77

OWAMP (w/ loss) > owping sunn-owamp.es.net!Approximately 12.6 seconds until results available!!--- owping statistics from [wash-owamp.es.net]:8852 to [sunn-owamp.es.net]:8837 ---!SID: !c681fe4ed67f1f0908224c341a2b83f3!first: !2014-01-13T18:27:22.032!last:!2014-01-13T18:27:32.904!100 sent, 12 lost (12.000%), 0 duplicates!one-way delay min/median/max = 31.1/31.1/31.3 ms, (err=0.00502 ms)!one-way jitter = nan ms (P95-P50)!Hops = 7 (consistently)!no reordering!!!--- owping statistics from [sunn-owamp.es.net]:9182 to [wash-owamp.es.net]:8893 ---!SID: !c67cfc7ed67f1f09531c87cf38381bb6!first: !2014-01-13T18:27:21.993!last:!2014-01-13T18:27:33.785!100 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 31.4/31.5/31.5 ms, (err=0.00502 ms)!one-way jitter = 0 ms (P95-P50)!Hops = 7 (consistently)!no reordering!

6/2/15 78

OWAMP (w/ re-‐ordering) > owping sunn-owamp.es.net!Approximately 12.9 seconds until results available!!--- owping statistics from [wash-owamp.es.net]:8814 to [sunn-owamp.es.net]:9062 ---!SID: !c681fe4ed67f21d94991ea335b7a1830!first:!2014-01-13T18:39:22.543!last: !2014-01-13T18:39:31.503!100 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 31.1/106/106 ms, (err=0.00201 ms)!one-way jitter = 0.1 ms (P95-P50)!Hops = 7 (consistently)!1-reordering = 19.000000%!2-reordering = 1.000000%!no 3-reordering!!--- owping statistics from [sunn-owamp.es.net]:8770 to [wash-owamp.es.net]:8939 ---!SID: !c67cfc7ed67f21d994c1302dff644543!first:!2014-01-13T18:39:22.602!last: !2014-01-13T18:39:31.279!100 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 31.4/31.5/32 ms, (err=0.00201 ms)!one-way jitter = 0 ms (P95-P50)!Hops = 7 (consistently)!no reordering!

6/2/15 79

Plo�ng owamp results

•  Add new owamp plot here

6/2/15 80

BWCTL (Traceroute) > bwtraceroute -T traceroute -s sacr-pt1.es.net –c wash-pt1.es.net!bwtraceroute: Using tool: traceroute!traceroute to 198.124.238.34 (198.124.238.34), 30 hops max, 60 byte packets! 1 sacrcr5-sacrpt1.es.net (198.129.254.37) 0.490 ms 0.788 ms 1.114 ms!

2 denvcr5-ip-a-sacrcr5.es.net (134.55.50.202) 21.304 ms 21.594 ms 21.924 ms! 3 kanscr5-ip-a-denvcr5.es.net (134.55.49.58) 31.944 ms 32.608 ms 32.838 ms!

4 chiccr5-ip-a-kanscr5.es.net (134.55.43.81) 42.904 ms 43.236 ms 43.566 ms! 5 washcr5-ip-a-chiccr5.es.net (134.55.36.46) 60.046 ms 60.339 ms 60.670 ms! 6 wash-pt1.es.net (198.124.238.34) 59.679 ms 59.693 ms 59.708 ms!

!> bwtraceroute -T traceroute -c sacr-pt1.es.net –s wash-pt1.es.net!

bwtraceroute: Using tool: traceroute!traceroute to 198.129.254.38 (198.129.254.38), 30 hops max, 60 byte packets! 1 wash-te-perf-if1.es.net (198.124.238.33) 0.474 ms 0.816 ms 1.145 ms!

2 chiccr5-ip-a-washcr5.es.net (134.55.36.45) 19.133 ms 19.463 ms 19.786 ms! 3 kanscr5-ip-a-chiccr5.es.net (134.55.43.82) 28.515 ms 28.799 ms 29.083 ms! 4 denvcr5-ip-a-kanscr5.es.net (134.55.49.57) 39.077 ms 39.348 ms 39.628 ms!

5 sacrcr5-ip-a-denvcr5.es.net (134.55.50.201) 60.013 ms 60.299 ms 60.983 ms! 6 sacr-pt1.es.net (198.129.254.38) 59.679 ms 59.678 ms 59.668 ms!

6/2/15 81

BWCTL (Tracepath) > bwtraceroute -T tracepath -s sacr-pt1.es.net –c wash-pt1.es.net!bwtraceroute: Using tool: tracepath!bwtraceroute: 36 seconds until test results available!!

SENDER START! 1?: [LOCALHOST] pmtu 9000! 1: sacrcr5-sacrpt1.es.net (198.129.254.37) 0.489ms !

1: sacrcr5-sacrpt1.es.net (198.129.254.37) 0.463ms ! 2: denvcr5-ip-a-sacrcr5.es.net (134.55.50.202) 21.426ms ! 3: kanscr5-ip-a-denvcr5.es.net (134.55.49.58) 31.957ms ! 4: chiccr5-ip-a-kanscr5.es.net (134.55.43.81) 42.947ms !

5: washcr5-ip-a-chiccr5.es.net (134.55.36.46) 60.092ms ! 6: wash-pt1.es.net (198.124.238.34) 59.753ms reached! Resume: pmtu 9000 hops 6 back 59 !

!SENDER END!

6/2/15 82

BWCTL (Tracepath with MTU mismatch) > bwtraceroute -‐T tracepath -‐c neJest.lbl.gov -‐s anl-‐pt1.es.net bwtraceroute: Using tool: tracepath 1?: [LOCALHOST] pmtu 9000 1: anlmr2-‐anlpt1.es.net (198.124.252.118) 0.249ms asymm 2 1: anlmr2-‐anlpt1.es.net (198.124.252.118) 0.197ms asymm 2 2: no reply 3: kanscr5-‐ip-‐a-‐chiccr5.es.net (134.55.43.82) 13.816ms 4: denvcr5-‐ip-‐a-‐kanscr5.es.net (134.55.49.57) 24.379ms 5: sacrcr5-‐ip-‐a-‐denvcr5.es.net (134.55.50.201) 45.298ms 6: sunncr5-‐ip-‐a-‐sacrcr5.es.net (134.55.40.6) 47.890ms 7: et-‐3-‐0-‐0-‐1411.er1-‐n1.lbl.gov (198.129.78.22) 50.093ms 8: t5-‐4.ir1-‐n1.lbl.gov (131.243.244.131) 50.772ms 9: t5-‐4.ir1-‐n1.lbl.gov (131.243.244.131) 52.669ms pmtu 1500 9: neJest.lbl.gov (131.243.24.11) 49.239ms reached Resume: pmtu 1500 hops 9 back 56 !

6/2/15 83

BWCTL (Ping) > bwping -‐c neJest.lbl.gov -‐s anl-‐pt1.es.net bwping: Using tool: ping PING 131.243.24.11 (131.243.24.11) from 198.124.252.117 : 56(84) bytes of data. 64 bytes from 131.243.24.11: icmp_seq=1 Jl=56 ,me=49.1 ms 64 bytes from 131.243.24.11: icmp_seq=2 Jl=56 ,me=49.1 ms 64 bytes from 131.243.24.11: icmp_seq=3 Jl=56 ,me=49.1 ms 64 bytes from 131.243.24.11: icmp_seq=4 Jl=56 ,me=49.1 ms 64 bytes from 131.243.24.11: icmp_seq=5 Jl=56 ,me=49.1 ms To test to a host not running bwctl, use “-‐E” > bwping -‐E -‐c www.google.com bwping: Using tool: ping PING 2607:f8b0:4010:800::1013(2607:f8b0:4010:800::1013) from 2001:400:2201:1190::3 : 56 data bytes 64 bytes from 2607:f8b0:4010:800::1013: icmp_seq=1 Jl=54 ,me=48.1 ms 64 bytes from 2607:f8b0:4010:800::1013: icmp_seq=2 Jl=54 ,me=48.2 ms 64 bytes from 2607:f8b0:4010:800::1013: icmp_seq=3 Jl=54 ,me=48.2 ms 64 bytes from 2607:f8b0:4010:800::1013: icmp_seq=4 Jl=54 ,me=48.2 ms

6/2/15 84

BWCTL (owamp) > bwping -T owamp -4 -s sacr-pt1.es.net –c wash-pt1.es.net!bwping: Using tool: owamp!bwping: 42 seconds until test results available!!

--- owping statistics from [198.129.254.38]:5283 to [198.124.238.34]:5121 ---!SID:!c67cee22d85fc3b2bbe23f83da5947b2!first: !2015-01-13T08:17:58.534!

last: !2015-01-13T08:18:17.581!10 sent, 0 lost (0.000%), 0 duplicates!one-way delay min/median/max = 29.9/29.9/29.9 ms, (err=0.191 ms)!one-way jitter = 0.1 ms (P95-P50)!

Hops = 5 (consistently)!no reordering!!

6/2/15 85

Tes*ng Pi�alls •  Don’t trust a single result

–  you should run any test a few *mes to confirm the results –  Hop-‐by-‐hop path characteris*cs may be con*nuously changing

•  A poor carpenter blames his tools –  The tools are only as good as the people using them, do it

methodically –  Trust the results – remember that they are giving you a number based

on the en*re environment •  If the site isn’t using perfSONAR – step 1 is to get them to do so

–  hNp://www.perfsonar.net •  Get some help from the community

–  perfsonar-‐user@internet2.edu

6/2/15 86

Resources

•  perfSONAR website –  hNp://www.perfsonar.net/

•  perfSONAR Toolkit Manual –  hNp://docs.perfsonar.net/

•  perfSONAR mailing lists –  hNp://www.perfsonar.net/about/

ge�ng-‐help/ •  perfSONAR directory

–  hNp://stats.es.net/ServicesDirectory/

•  FasterData Knowledgebase –  hNp://fasterdata.es.net/

6/2/15 87

EXTRA SLIDES

6/2/15 88

Troubleshooting Network Performance Issues with Active Monit(PDF)

Documents

Transcript of Troubleshooting Network Performance Issues with Active Monit(PDF)

Troubleshooting Active Directory Lingering Objectsvideo.ch9.ms/sessions/teched/eu/2014/Labs/EM-H400.pdf · Troubleshooting Active Directory Lingering Objects Analysis and Troubleshooting

Temperature Monit

MONIT - aviana.comaviana.com/step/monit/NORWAY_IS_CASE.pdf · The MONIT project was endorsed by the TIP working party in December 2002. Building on the results of the TIP NIS project,

Module 10: Troubleshooting Active Directory, DNS, and Replication Issues

Troubleshooting Techniques for Microsoft Active …download.microsoft.com/.../ES_Troubleshooting_ActiveDirectory.pdfTroubleshooting Techniques for Microsoft Active Directory Renato

Troubleshooting of Passive Cabling and Active Networks Touch ‘n‘ … · 2019-10-12 · Troubleshooting of Passive Cabling and Active Networks Copper, Fibre and WiFi Test Ports

Time-delayed control design for active control of … CONTROL AND HEALTH MONITORING Struct. Control Health Monit. 2007; 14:27–61 Time-delayed control design for active control of

Monit Presentation

7-0 Basic Active Product Troubleshooting Skill

Active Dir Troubleshooting

Avan Monit PSL2014

11 ACTIVE DIRECTORY MAINTENANCE, TROUBLESHOOTING, AND DISASTER RECOVERY Chapter 11.

Iâ€™mOnIt! Quick Start Guide

Monit - utility for monitor · 2020. 8. 13. · NAME Monit - utility for monitoring services on a Unix system SYNOPSIS monit [options] {arguments} DESCRIPTION monit is a utility for

Active Directory Troubleshooting, Auditing, and Best Practices · Best Practices 2011 Edition. The Definitive Guide to Active Directory Troubleshooting, Auditing, and Best Practices

ISTRUCTI MAUAL TROUBLESHOOTING GUIDE ACTIVE AQUA …

Active Directory Maintenance, Troubleshooting, and Disaster Recovery Lesson 11.

Active Directory Replication Issues and Troubleshooting

Active Directory Troubleshooting, Auditing, and Best · PDF fileThe Definitive Guide to Active Directory Troubleshooting, Auditing, and Best ... Active Directory Troubleshooting, Auditing,

SLA Troubleshooting with Active Testingdownload.tek.com/document/SLA Troubleshooting with Active Testin… · 06/08 | CCW-22198-0 About Tektronix: Tektronix has more than 60 years