System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

13
System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers

Transcript of System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Page 1: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

System Troubleshooting TCS

Network, System, and Load Monitoring TCS for Developers

Page 2: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

LBT TCS Cluster

Page 3: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Networking VLANS for private

networks 6 Gb non-blocking,

full duplex backbone.

Latency, Throughput, Data Rate

Broadcast Multicast TCP/UDP Bottleneck at the

desktop workstations

Page 4: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Diagnostics Theory Memory bound versus CPU bound Network throughput versus speed Multithreading errors Subsystem Interaction printf and syslog Standard Out and Standard Error

Page 5: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Monitoring and Diagnostic Tools /sbin/tcpdump /sbin/ifconfig cacti top syslog

top vmstat R gnuplot

Page 6: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

tcpdump

Interactive

-lett -i <device> {limit}

Device can be eth0 or eth0.20 for vlans

Gather Only

-i <device> -w <file>

Gathers all raw packets and writes them to a file for processing later

Page 7: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Reflective Memory

17:51:34.494273 IP 10.10.0.238.5000 > 10.10.0.255.5000: UDP, length 102817:51:34.494282 IP 10.10.0.238.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.494397 IP 10.10.0.239.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.494522 IP 10.10.0.240.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.494531 IP 10.10.0.241.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.504062 IP 10.10.0.245.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.504144 IP 10.10.0.248.5000 > 10.10.0.255.5000: UDP, length 6017:51:34.504266 IP 10.10.0.238.5000 > 10.10.0.255.5000: UDP, length 1028

[root@lbtmu107 ~]# tcpdump -i eth0

Page 8: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

ifconfig

eth0 Link encap:Ethernet HWaddr 00:11:11:10:04:10 inet6 addr: fe80::211:11ff:fe10:410/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:402698793 errors:0 dropped:0 overruns:0 frame:0 TX packets:74367255 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3500999197 (3.2 GiB) TX bytes:3982146708 (3.7 GiB) Base address:0xdf40 Memory:fbee0000-fbf00000

eth0.10 Link encap:Ethernet HWaddr 00:11:11:10:04:10 inet addr:10.144.0.131 Bcast:10.144.0.255 Mask:255.255.255.0 inet6 addr: fe80::211:11ff:fe10:410/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:12609308 errors:0 dropped:0 overruns:0 frame:0 TX packets:9774513 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2701235204 (2.5 GiB) TX bytes:1087406483 (1.0 GiB)

[root@lbtmu01 ~]# ifconfig -a

Page 9: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Cacti (http://ldap.lbto.arizona.edu/cacti/)

www.cacti.net LDAP

authentication Customizable views Full Deployment

September, 2006

Page 10: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

top Time spent lost in system is probably io

which includes networking Sort by memory usage with “M” Top inaccurately reports itself

Page 11: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

vmstatVmstat is a linux utility for monitoring virtual

memory usage. It can also be used to track down I/O problems including networking.

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 626164 533248 12488 64388 1 2 6 5 44 44 9 3 88 0 0 0 626164 533136 12488 64388 0 0 0 0 1613 1161 5 2 93 0 0 0 626164 533136 12496 64388 0 0 0 12 1642 1189 5 3 92 0 0 0 626164 533136 12496 64388 0 0 0 0 1645 1247 4 2 94 0 0 0 626164 533128 12496 64388 0 0 0 0 1640 1195 5 3 92 0 0 0 626164 533128 12496 64388 0 0 0 0 1631 1248 4 2 93 0 1 0 626164 533200 12496 64388 0 0 0 0 1674 1288 5 3 92 0 0 0 626164 533200 12496 64388 0 0 0 1 1622 1210 4 2 94 0 0 0 626164 533200 12500 64388 0 0 0 17 1705 1312 6 3 91 0 0 0 626164 533200 12500 64388 0 0 0 0 1649 1261 5 3 93 0

Page 12: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Statistical Analysis

R, gnuplot, and Matlab

All of these packages give you a different view of the data that you gather.

Even if you are not comfortable with them, someone else might be.

Graphs, Charts, baselines, etc…

Page 13: System Troubleshooting TCS Network, System, and Load Monitoring TCS for Developers.

Syslog /var/log/TCS/?[telescope@lbtmu01 ~]$ tail -f /var/log/TCS/user Jul 24 20:55:19 lbtmu105 LBT_ECS: Thermal failed to connect to IP

10.144.0.205 port 50010 Jul 24 20:55:20 lbtmu105 LBT_ECS: Thermal not connected to ThermalBox,

Send Cmd failed Jul 24 20:55:32 lbtmu105 LBT_ECS: Thermal failed to connect to IP

10.144.0.205 port 50010 Jul 24 20:55:33 lbtmu105 LBT_ECS: Thermal not connected to ThermalBox,

Send Cmd failed Jul 24 20:55:43 lbtmu103 last message repeated 58 timesJul 24 20:55:45 lbtmu105 LBT_ECS: Thermal failed to connect to IP

10.144.0.205 port 50010 Jul 24 20:55:46 lbtmu105 LBT_ECS: Thermal not connected to ThermalBox,

Send Cmd failed Jul 24 20:55:58 lbtmu105 LBT_ECS: Thermal failed to connect to IP

10.144.0.205 port 50010 Jul 24 20:55:59 lbtmu105 LBT_ECS: Thermal not connected to ThermalBox,

Send Cmd failed