RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew...
Transcript of RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew...
![Page 1: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/1.jpg)
OpenStack Summit Barcelona
RabbitMQ at Scale, Lessons Learned Matthew Popow, Weiguo Sun, Wei Tie, Scott Pham, Kerry Miles @mattpopow
October 26, 2016
![Page 2: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/2.jpg)
2
• Overcloud / Undercloud • 800+ Nova Compute Nodes, 700 Routers, 1,000+ networks, 10,000 + ports • Each OpenStack service is run on 3 controller VMs • Neutron & OVS & L3 agent
• RabbitMQ Cluster • 3 x 4 vCPU 8GB RAM • 3 node Active / Active cluster
• RHEL 7, RabbitMQ 3.3.5-22, ErlangR16B-03.7 • Icehouse & Juno (OSP)
• No heartbeat or QoS
Environment Details
![Page 3: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/3.jpg)
3
When things go astray
![Page 4: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/4.jpg)
![Page 5: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/5.jpg)
5
• nova compute services down/flapping • Instances failing to boot, waiting for port binding • Neutron agent timeouts • RabbitMQ queues growing
Timeout waiting on RPC response - topic: "q-plugin", RPC method: "report_state" info: "<unknown>”
Symptoms of RabbitMQ Issues
![Page 6: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/6.jpg)
6
Restarting services can compound the issue
![Page 7: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/7.jpg)
7
OpenStack Configuration Parameters Neutron & Nova Client Configuration Enlarge rpc pool • rpc_thread_pool_size = 2048 • rpc_conn_pool_size = 60 Extend timeouts • rpc_response_timeout = 960 (especially for large neutron stacks)
• get_active_network_info(), sync_state() (dhcp-agent, l3-agent) • performance optimization in Kilo
Add more workers/consumers • rpc_workers = 4 (we run 3 neutron controllers)
![Page 8: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/8.jpg)
8
3 neutron rpc_workers, 10k backlog
![Page 9: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/9.jpg)
9
18 neutron rpc_workers, 10k backlog
![Page 10: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/10.jpg)
10
Client disconnect & (404) errors
• Even with RPC tuning, frequent disconnects / reconnects • With reconnect seeing 404
• Agent restart was needed • OVS flows reloaded (pre-Liberty) L
(404) NOT_FOUND - no queue 'q-agent-notifier-network-delete_fanout_a4db343065984f74971fe0080013744e' in vhost '/'
![Page 11: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/11.jpg)
11
Kombu Driver (Icehouse) / Oslo
• Race condition with auto-delete queues • Before Juno, auto-delete was not configurable for Neutron • When reconnect occurs, race with queue declaration and auto-delete • Backport Oslo driver • Kombu driver improvements for Neutron
![Page 12: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/12.jpg)
12
neutron/openstack/common/rpc/impl_kombu.py
https://bugs.launchpad.net/neutron/+bug/1393391 Ubuntu Cloud Archive Patch
![Page 13: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/13.jpg)
13
Connection issues?
![Page 14: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/14.jpg)
14
RabbitMQ Erlang Configuration
RABBITMQ_SERVER_ERL_ARGS="+K true +A128 +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]” • +K true # sets keepalive • +A 128
• sets Erlang VM I/O Thread Pool Size • {raw,6,18,<<5000:64/native>>}
• Sets TCP_USER_TIMEOUT to 5 seconds, with the idea being to quickly detect when an established connection fails
• Common config recommendation
![Page 15: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/15.jpg)
15
TCP_USER_TIMEOUT notes
• Note, setting TCP_USER_TIMEOUT will override tcp_keepalive timers if its shorter. • Dropping a single TCP keepalive packet could trigger a socket teardown.
• Can happen between RabbitMQ and Client or between cluster
=ERROR REPORT==== 22-May-2016::08:36:42 ===closing AMQP connection <0.24752.1> (10.203.106.41:35234 -> 10.203.108.11:5672):{inet_error,etimedout} RPC QoS added in Liberty: https://bugs.launchpad.net/oslo.messaging/+bug/1531222
![Page 16: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/16.jpg)
16
Virtualizing Control Plane • Default KVM txqueuelen is tiny, 500 packets # ifconfig tap79920654-fa tap79920654-fa: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether fe:16:3e:71:7c:d3 txqueuelen 10000 (Ethernet) RX packets 8360296 bytes 2076339428 (1.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 5544232 bytes 793456462 (756.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # Add udev rule to Set Tap Interface TX Queue Length KERNEL=="tap*", RUN+="/sbin/ip link set %k txqueuelen 10000”
![Page 17: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/17.jpg)
17
• Never Suspend RabbitMQ VM • Monitor Hypervisor for issues
• CPU soft lockup (can trigger partition) • Disk / Memory Contention • RAID / IO controller resets
Virtualizing Control Plane Cont.
![Page 18: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/18.jpg)
18
RabbitMQ Configuration [ {rabbit, [ {cluster_nodes, {['rabbit@rabbitmq-001', 'rabbit@rabbitmq-002', 'rabbit@rabbitmq-003'], disc}}, {cluster_partition_handling, pause_minority}, {vm_memory_high_watermark, 0.4}, {tcp_listen_options, [binary, {packet,raw}, {reuseaddr,true}, #reuse sockets in TIME_WAIT, not safe for NAT {backlog,128},
{nodelay,true}, # disabling Nagle’s Algorithm for increased throughput {exit_on_close,false}, {keepalive,true}]} #enable tcp keepalives ]}
].
![Page 19: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/19.jpg)
19
RabbitMQ Process Level Tuning Limit Soft Limit Hard Limit Units
Max CPU Time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size unlimited unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 65536 65536 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals unlimited unlimited signals
Max msgqueue size unlimited unlimited bytes
Max nice priority unlimited unlimited
Max realtime priority unlimited unlimited
Max realtime timeout unlimited unlimited us
![Page 20: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/20.jpg)
20
• i.e. RHEL/CentOS: • cat /proc/$(cat /var/run/rabbitmq/pid)/limits
Verify Process Limits
![Page 21: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/21.jpg)
21
• Pause minority vs. autoheal • CAP Theorem; Consistency vs. Availability • Pause minority require quorum, will pause if only one node is alive • Autoheal & pause_minority not perfect
• Partition monitoring and alerting • Automation to restore partition
• Wipe/var/lib/rabbitmq/mnesia on problem node, restart RabbitMQ
Partition Handling
![Page 22: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/22.jpg)
22
Queue Mirroring
• Set by RabbitMQ policy • {“ha-mode”:”all”}
• rabbit_ha_queues = True #depricated • Not applicable for RabbitMQ > 3.x
• Mirroring not needed for RPC, expensive • Only mirror billing queues, notification • Deployment examples in Liberty without queue mirroring
• Policy change likely requires cluster restart
![Page 23: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/23.jpg)
23
• Default TCP settings are not ideal • 2 hours before first TCP probe is sent
• Then resend probe every 75 seconds • No ACK for 9 times, then mark connection as dead
• Adjusting can help with client failover / reconnection
Operating System Tuning
Parameter Value net.ipv4.tcp_keepalive_time 5
net.ipv4.tcp_keepalive_probes 5
net.ipv4.tcp_keepalive_intvl 1
net.ipv4.tcp_retries2: 3
![Page 24: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/24.jpg)
24
Monitoring RabbitMQ • Health Check / Query each node (rabbitmqadmin)
• cluster health / partition status • erlang mem util vs high-water mark • file descriptors used • sockets used • process utilization • system memory • disk utilization • queues and number of unacked messages • total unacked messages • rabbitmq.log for alarm sets:
“New connections will not be accepted until this alarm clears”
![Page 25: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/25.jpg)
25
Monitoring RabbitMQ Cont.
• Synthetic Tests • Boot a VM; Create Router / Network / Ping VM • Create Volume • Upload Image
• Failures of synthetic transaction can indicate RabbitMQ issue
![Page 26: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/26.jpg)
26
• Rabbitmqadmin vs rabbitmqctl • Before 3.6.0 rabbitmqctl list commands did not stream • Hang with stuck queues
• Monitor memory management of stats database • rabbitmqctl status • rabbitmqctl eval 'exit(erlang:whereis(rabbit_mgmt_db), please_terminate).'
• Disabling RabbitMQ UI • Adjusting collect_statistics_interval, default 5000ms • rabbitmqctl eval 'application:set_env(rabbit, collect_statistics_interval, 60000).’
Tips
![Page 27: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/27.jpg)
27
Tips Cont.
• Set policy for Queue TTL • {“expires”: “#ms”} • > rpc_mesage_timeout • 0 consumers
• Don’t use auto-delete queues • If lots of reconnect between client / server
investigate rpc tuning & network stack • rabbit_hosts=<randomize order>
![Page 28: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/28.jpg)
28
Architectural Decisions
Nova, Neutron, Glance Cinder
• Single Cluster vs. Many
Ceilometer, Heat
![Page 29: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/29.jpg)
29
• Troubleshooting oslo.messaging / RabbitMQ issues (Austin 2016)
• Troubleshooting RabbitMQ and Its Stability Improvement (Tokyo 2015)
• rabbitmq-users
Resources
![Page 30: RabbitMQ at Scale, Lessons Learned - OpenStack€¦ · RabbitMQ at Scale, Lessons Learned Matthew Popow, ... • No heartbeat or QoS ... # Add udev rule to Set Tap Interface TX Queue](https://reader031.fdocuments.in/reader031/viewer/2022021801/5b3cfaf67f8b9a26728db5d2/html5/thumbnails/30.jpg)
30
Q&A