Debugging Distributed Systems - Velocity Santa Clara 2016
-
Upload
donny-nadolny -
Category
Software
-
view
570 -
download
1
Transcript of Debugging Distributed Systems - Velocity Santa Clara 2016
2016−06−23
Debugging Distributed SystemsDonny Nadolny
#VelocityConf
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Distributed system for building distributed systems • Small in-memory filesystem
What is ZooKeeper
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Distributed locking • Consistent, highly available
ZooKeeper at PagerDuty
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Distributed locking • Consistent, highly available
ZooKeeper at PagerDuty… Over A WAN
ZK 3
ZK 1 ZK 2
DC-A
DC-C
DC-B
24 ms
24 ms 3
ms
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up
The Failure
1
2
DB S
ize
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up
The Failure
11
2
2
1.5
DB S
ize
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt
• Similar failure profile
Fault Injection
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt
• Similar failure profile • Re-examine disk latency… nope, was a red herring
Fault Injection
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• First warning: application monitoring
• Monitoring ZooKeeper: used ruok
Health Checks
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Added deep health check: • write to one ZooKeeper key
• read from ZooKeeper key
Deep Health Checks
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000]
java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
…
at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) …
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)
…
at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
The Stack Trace
1
2
3
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); }}
Write Snapshot Code (simplified)
Blocking network write
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Why didn’t a follower take over? • ZK heartbeat: message from leader to follower, follower times out
ZooKeeper Heartbeat
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
Quorum Peer
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHED
Packet 1
ACK
… SYN, SYN-ACK, ACK …
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
Packet 1 ~200ms
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
Packet 1 ~200ms
Packet 1 ~200ms
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
Packet 1 ~200ms
Packet 1 ~200ms
~400msPacket 1
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
Packet 1 ~200ms
Packet 1 ~200ms
~400msPacket 1
~800msPacket 1
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Data Transmission
Follower LeaderESTABLISHED ESTABLISHEDPacket 1
Packet 1 ~200ms
Packet 1 ~200ms
~400msPacket 1
~800ms
~
120sec
Packet 1
Packet 1 120sec
CLOSED
15 retries…
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms
• TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)
TCP Retransmission (Linux Defaults)
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
1. Network trouble begins - packet loss / latency
2. Follower falls behind, restarts, requests snapshot
3. Leader begins to send snapshot
4. Snapshot transfer stalls
5. Follower ZooKeeper restarts, attempts to close connection
6. Network heals
7. … Leader still stuck
Timeline
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
1. Network trouble begins - packet loss / latency
2. Follower falls behind, restarts, requests snapshot
3. Leader begins to send snapshot
4. Snapshot transfer stalls
5. Follower ZooKeeper restarts, attempts to close connection
6. Network heals
7. … Leader still stuck
Timeline
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower LeaderESTABLISHED ESTABLISHED
FIN/ACK
FIN
ACK
LAST_ACK
CLOSED
TIME_WAIT
CLOSED
60 seconds
FIN_WAIT1
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower LeaderESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FINFINFIN
FIN
FIN
8 retries ~
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower LeaderESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED~15.5 mins
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower LeaderESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FIN Packet 1
CLOSEDRST
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d:12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0
syslog - Dropped Packets on Follower
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
iptables
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
But: iptables connections != netstat connections
iptables
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• From linux/net/netfilter/nf_conntrack_proto_tcp.c:
• [TCP_CONNTRACK_LAST_ACK] = 30 SECS
conntrack Timeouts
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower Leader
CLOSED
~51.2s
FIN_WAIT1 FINFINFIN
FIN
FIN~25.6s
kernel TCPconntrackLAST_ACK
30s
30s
30s
30s
CLOSED
~12.8s
30s
~81.2s~102.4s
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Packet loss • Follower falls behind, requests snapshot
• (Packet loss continues) follower closes connection • Follower conntrack forgets connection
• Leader now stuck for ~15 mins, even if network heals
The Full Story
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Follower falls behind: tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35%
• Wait for a few minutes
(Alternate: kill the follower)
Reproducing (1/3) - Setup
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Remove latency / packet loss: tc qdisc del dev eth0 root netem
• Restrict bandwidth: tc qdisc add dev eth0 handle 1: root htb default 11
tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps
tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps
• Restart follower ZooKeeper process
Reproducing (2/3) - Request Snapshot
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Block traffic to leader: iptables -A OUTPUT -p tcp -d <leader ip> -j DROP
• Remove bandwidth restriction:
tc qdisc del dev eth0 root
• Kill follower ZooKeeper process, kernel tries to close connection
• Monitor conntrack status, wait for entry to disappear, ~80 seconds: conntrack -L | grep <leader ip>
• Allow traffic to leader: iptables -D OUTPUT -p tcp -d <leader ip> -j DROP
Reproducing (3/3) - Carefully Close Connection
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
IPsec
Follower Leader
ESP (UDP)
ESP (UDP)IPsecTCP data
IPsecTCP data
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
IPsec - Establish Connection
Follower Leader
IPsec Phase 1
IPsec Phase 2
TCP data
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
IPsec - Dropped Packets
Follower Leader
IPsec Phase 1
IPsec Phase 2
TCP data
TCP data
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
IPsec - Heartbeat
Follower Leader
IPsec Phase 1
IPsec Phase 2
IPsec Heartbeat
TCP data
TCP data
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder
Lesson 1
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Automate debug info collection (stack trace, heap dump, txn logs, etc)
Lesson 2
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Application/dependency checks should be deep health checks! • Leader/follower heartbeats should be deep health checks!
Lesson 3
2016−06−23
Questions? [email protected]
Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201
2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #add latency tc qdisc del dev eth0 root netem #remove latency
tc qdisc add dev eth0 handle 1: root htb default 11 #restrict bandwidth tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps
tc qdisc del dev eth0 root #remove bandwidth restriction #tip: when doing latency / loss / bandwidth restriction, can be helpful to do "sleep 60 && <tc delete command> & disown" in case you lose ssh access
tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump #capture packets, then open locally in wireshark
iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic #can use INPUT / OUTPUT chain for incoming / outgoing traffic, also --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip>
sshfs [email protected]:/tmp/data /mnt #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89 #alternative: nbd (network block device)
netstat -peanut #network connections, regular kernel view conntrack -L #network connections, iptables view
“Mess With The Network” Cheat Sheet