Debugging Distributed Systems - Velocity Santa Clara 2016

2016−06−23

Debugging Distributed SystemsDonny Nadolny

[email protected]

#VelocityConf

mailto:[email protected]?subject=

DEBUGGING DISTRIBUTED SYSTEMS 2016−06−23

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Distributed system for building distributed systems • Small in-memory filesystem

What is ZooKeeper


• Distributed locking • Consistent, highly available

ZooKeeper at PagerDuty


• Distributed locking • Consistent, highly available

ZooKeeper at PagerDuty… Over A WAN

ZK 3

ZK 1 ZK 2

DC-A

DC-C

DC-B

24 ms

24 ms 3

ms


• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up

The Failure

1

2

DB S

ize


• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up

The Failure

11

2

2

1.5

DB S

ize


• Leader logs: “Too busy to snap, skipping”

First Hint


• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile

Fault Injection


• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile • Re-examine disk latency… nope, was a red herring

Fault Injection


• First warning: application monitoring

• Monitoring ZooKeeper: used ruok

Health Checks


• Added deep health check: • write to one ZooKeeper key

• read from ZooKeeper key

Deep Health Checks


"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000]

java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)

…

at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) …

at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)

at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)

…

at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)

The Stack Trace

1

2

3


Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests


void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); }}

Write Snapshot Code (simplified)

Blocking network write


• Why didn’t a follower take over? • ZK heartbeat: message from leader to follower, follower times out

ZooKeeper Heartbeat


Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

Quorum Peer

2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC

TCP


TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHED

Packet 1

ACK

… SYN, SYN-ACK, ACK …



Follower LeaderESTABLISHED ESTABLISHEDPacket 1




Packet 1 ~200ms




Packet 1 ~200ms

Packet 1 ~200ms




Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1




Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

~800msPacket 1




Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

~800ms

~

120sec

Packet 1

Packet 1 120sec

CLOSED

15 retries…


• Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms

• TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)

TCP Retransmission (Linux Defaults)


1. Network trouble begins - packet loss / latency

2. Follower falls behind, restarts, requests snapshot

3. Leader begins to send snapshot

4. Snapshot transfer stalls

5. Follower ZooKeeper restarts, attempts to close connection

6. Network heals

7. … Leader still stuck

Timeline


TCP Close Connection


FIN/ACK

FIN

ACK

LAST_ACK

CLOSED

TIME_WAIT

CLOSED

60 seconds

FIN_WAIT1




CLOSED~1m40s

FIN_WAIT1 FINFINFIN

FIN

FIN

8 retries ~




CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSED~15.5 mins




CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSEDRST


• 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d:12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0

syslog - Dropped Packets on Follower


iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

iptables


iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

But: iptables connections != netstat connections

iptables


• From linux/net/netfilter/nf_conntrack_proto_tcp.c:

• [TCP_CONNTRACK_LAST_ACK] = 30 SECS

conntrack Timeouts



Follower Leader

CLOSED

~51.2s

FIN_WAIT1 FINFINFIN

FIN

FIN~25.6s

kernel TCPconntrackLAST_ACK

30s

30s

30s

30s

CLOSED

~12.8s

30s

~81.2s~102.4s


• Packet loss • Follower falls behind, requests snapshot

• (Packet loss continues) follower closes connection • Follower conntrack forgets connection

• Leader now stuck for ~15 mins, even if network heals

The Full Story


• Follower falls behind: tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35%

• Wait for a few minutes

(Alternate: kill the follower)

Reproducing (1/3) - Setup


• Remove latency / packet loss: tc qdisc del dev eth0 root netem

• Restrict bandwidth: tc qdisc add dev eth0 handle 1: root htb default 11

tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps

tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps

• Restart follower ZooKeeper process

Reproducing (2/3) - Request Snapshot


• Block traffic to leader: iptables -A OUTPUT -p tcp -d <leader ip> -j DROP

• Remove bandwidth restriction:

tc qdisc del dev eth0 root

• Kill follower ZooKeeper process, kernel tries to close connection

• Monitor conntrack status, wait for entry to disappear, ~80 seconds: conntrack -L | grep <leader ip>

• Allow traffic to leader: iptables -D OUTPUT -p tcp -d <leader ip> -j DROP

Reproducing (3/3) - Carefully Close Connection


IPsec


IPsec

Follower Leader

ESP (UDP)

ESP (UDP)IPsecTCP data

IPsecTCP data


IPsec - Establish Connection

Follower Leader

IPsec Phase 1

IPsec Phase 2

TCP data


IPsec - Dropped Packets

Follower Leader

IPsec Phase 1

IPsec Phase 2

TCP data

TCP data


IPsec - Heartbeat

Follower Leader

IPsec Phase 1

IPsec Phase 2

IPsec Heartbeat

TCP data

TCP data


Lessons


• Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder

Lesson 1


• Automate debug info collection (stack trace, heap dump, txn logs, etc)

Lesson 2


• Application/dependency checks should be deep health checks! • Leader/follower heartbeats should be deep health checks!

Lesson 3

2016−06−23

Questions? [email protected]

Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201

https://issues.apache.org/jira/browse/ZOOKEEPER-2201


tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #add latency tc qdisc del dev eth0 root netem #remove latency

tc qdisc add dev eth0 handle 1: root htb default 11 #restrict bandwidth tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps

tc qdisc del dev eth0 root #remove bandwidth restriction #tip: when doing latency / loss / bandwidth restriction, can be helpful to do "sleep 60 && <tc delete command> & disown" in case you lose ssh access

tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump #capture packets, then open locally in wireshark

iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic #can use INPUT / OUTPUT chain for incoming / outgoing traffic, also --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip>

sshfs [email protected]:/tmp/data /mnt #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89 #alternative: nbd (network block device)

netstat -peanut #network connections, regular kernel view conntrack -L #network connections, iptables view

“Mess With The Network” Cheat Sheet

Debugging Distributed Systems - Velocity Santa Clara 2016

Software

Transcript of Debugging Distributed Systems - Velocity Santa Clara 2016