Debugging Distributed Systems - Velocity Santa Clara 2016

55
20160623 Debugging Distributed Systems Donny Nadolny [email protected] #VelocityConf

Transcript of Debugging Distributed Systems - Velocity Santa Clara 2016

2016−06−23

Debugging Distributed SystemsDonny Nadolny

[email protected]

#VelocityConf

DEBUGGING DISTRIBUTED SYSTEMS 2016−06−23

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Distributed system for building distributed systems • Small in-memory filesystem

What is ZooKeeper

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Distributed locking • Consistent, highly available

ZooKeeper at PagerDuty

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Distributed locking • Consistent, highly available

ZooKeeper at PagerDuty… Over A WAN

ZK 3

ZK 1 ZK 2

DC-A

DC-C

DC-B

24 ms

24 ms 3

ms

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up

The Failure

1

2

DB S

ize

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up

The Failure

11

2

2

1.5

DB S

ize

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Leader logs: “Too busy to snap, skipping”

First Hint

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile

Fault Injection

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile • Re-examine disk latency… nope, was a red herring

Fault Injection

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• First warning: application monitoring

• Monitoring ZooKeeper: used ruok

Health Checks

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Added deep health check: • write to one ZooKeeper key

• read from ZooKeeper key

Deep Health Checks

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000]

java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)

at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) …

at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)

at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)

at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)

The Stack Trace

1

2

3

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); }}

Write Snapshot Code (simplified)

Blocking network write

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Why didn’t a follower take over? • ZK heartbeat: message from leader to follower, follower times out

ZooKeeper Heartbeat

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

Threads (Leader)

Request processors

Learner handler (one per follower)

Client requests

Quorum Peer

2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC

TCP

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHED

Packet 1

ACK

… SYN, SYN-ACK, ACK …

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

Packet 1 ~200ms

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

~800msPacket 1

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

~800ms

~

120sec

Packet 1

Packet 1 120sec

CLOSED

15 retries…

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms

• TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)

TCP Retransmission (Linux Defaults)

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

1. Network trouble begins - packet loss / latency

2. Follower falls behind, restarts, requests snapshot

3. Leader begins to send snapshot

4. Snapshot transfer stalls

5. Follower ZooKeeper restarts, attempts to close connection

6. Network heals

7. … Leader still stuck

Timeline

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

1. Network trouble begins - packet loss / latency

2. Follower falls behind, restarts, requests snapshot

3. Leader begins to send snapshot

4. Snapshot transfer stalls

5. Follower ZooKeeper restarts, attempts to close connection

6. Network heals

7. … Leader still stuck

Timeline

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

FIN/ACK

FIN

ACK

LAST_ACK

CLOSED

TIME_WAIT

CLOSED

60 seconds

FIN_WAIT1

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FINFINFIN

FIN

FIN

8 retries ~

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSED~15.5 mins

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSEDRST

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d:12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0

syslog - Dropped Packets on Follower

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

iptables

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

But: iptables connections != netstat connections

iptables

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• From linux/net/netfilter/nf_conntrack_proto_tcp.c:

• [TCP_CONNTRACK_LAST_ACK] = 30 SECS

conntrack Timeouts

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower Leader

CLOSED

~51.2s

FIN_WAIT1 FINFINFIN

FIN

FIN~25.6s

kernel TCPconntrackLAST_ACK

30s

30s

30s

30s

CLOSED

~12.8s

30s

~81.2s~102.4s

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Packet loss • Follower falls behind, requests snapshot

• (Packet loss continues) follower closes connection • Follower conntrack forgets connection

• Leader now stuck for ~15 mins, even if network heals

The Full Story

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Follower falls behind: tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35%

• Wait for a few minutes

(Alternate: kill the follower)

Reproducing (1/3) - Setup

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Remove latency / packet loss: tc qdisc del dev eth0 root netem

• Restrict bandwidth: tc qdisc add dev eth0 handle 1: root htb default 11

tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps

tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps

• Restart follower ZooKeeper process

Reproducing (2/3) - Request Snapshot

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Block traffic to leader: iptables -A OUTPUT -p tcp -d <leader ip> -j DROP

• Remove bandwidth restriction:

tc qdisc del dev eth0 root

• Kill follower ZooKeeper process, kernel tries to close connection

• Monitor conntrack status, wait for entry to disappear, ~80 seconds: conntrack -L | grep <leader ip>

• Allow traffic to leader: iptables -D OUTPUT -p tcp -d <leader ip> -j DROP

Reproducing (3/3) - Carefully Close Connection

2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC

IPsec

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

IPsec

Follower Leader

ESP (UDP)

ESP (UDP)IPsecTCP data

IPsecTCP data

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

IPsec - Establish Connection

Follower Leader

IPsec Phase 1

IPsec Phase 2

TCP data

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

IPsec - Dropped Packets

Follower Leader

IPsec Phase 1

IPsec Phase 2

TCP data

TCP data

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

IPsec - Heartbeat

Follower Leader

IPsec Phase 1

IPsec Phase 2

IPsec Heartbeat

TCP data

TCP data

2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC

Lessons

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder

Lesson 1

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Automate debug info collection (stack trace, heap dump, txn logs, etc)

Lesson 2

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

• Application/dependency checks should be deep health checks! • Leader/follower heartbeats should be deep health checks!

Lesson 3

2016−06−23

Questions? [email protected]

Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS

tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #add latency tc qdisc del dev eth0 root netem #remove latency

tc qdisc add dev eth0 handle 1: root htb default 11 #restrict bandwidth tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps

tc qdisc del dev eth0 root #remove bandwidth restriction #tip: when doing latency / loss / bandwidth restriction, can be helpful to do "sleep 60 && <tc delete command> & disown" in case you lose ssh access

tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump #capture packets, then open locally in wireshark

iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic #can use INPUT / OUTPUT chain for incoming / outgoing traffic, also --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip>

sshfs [email protected]:/tmp/data /mnt #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89 #alternative: nbd (network block device)

netstat -peanut #network connections, regular kernel view conntrack -L #network connections, iptables view

“Mess With The Network” Cheat Sheet