How to Troubleshoot OpenStack Without Losing Sleep

TRUSTED CLOUD SOLUTIONS

OpenStack Summit Austin

WE HAVE OUR RIGHT TO SLEEP

Sadique Puthen & Dustin BlackCloud Success Architect 26th April 2016

How To Troubleshoot Openstack Without Losing Sleep

[email protected]@sadiquepp

[email protected]@dustinlblack

Manifestation of a Problem

“Our compute service on the compute node is stuck in a state of activating.”

“Most OpenStack Overcloud neutron services inactive and disabled”

No valid host was found. Exceeded max scheduling attempts 3 for instance

PortLimitExceeded: Maximum number of ports exceeded

“User unable to launch new instances”

Instance failed to spawn

Over-Working RabbitMQClick to add subtitle

Insert paragraph of copy here and graphic in box to the right.

● Bullet● Bullet● Bullet

Over-Working RabbitMQProblem Description: Our compute service on the compute node is stuck in a state of activating

Initial evidence are non-descriptive timeouts:

# journalctl --all --this-boot --no-pager | grep novaMay 27 16:20:50 host.example.com systemd[1]: openstack-nova-compute.service operation timed out. Terminating.May 27 16:20:50 host.example.com systemd[1]: Unit openstack-nova-compute.service entered failed state.May 27 16:20:50 host.example.com systemd[1]: openstack-nova-compute.service holdoff time over, scheduling restart.

Rebooting the compute node doesn’t help.

Over-Working RabbitMQProblem Description: Our compute service on the compute node is stuck in a state of activating

An strace of the nova-compute service reveals our trouble communicating with rabbit:

# grep :5672 compute.strace 12938 03:29:28.320069 write(3, "2015-05-28 03:29:28.319 12938 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.100.47:5672 is unreachable: Socket closed. Trying again in 1 seconds.\n", 169) = 169 <0.000019>12938 03:29:29.321779 write(3, "2015-05-28 03:29:29.321 12938 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 192.168.100.48:5672\n", 126) = 126 <0.000061>12938 03:29:30.333894 write(3, "2015-05-28 03:29:30.333 12938 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.100.48:5672\n", 123) = 123 <0.000013>

Over-Working RabbitMQ

The strace leads to more logs...The logs lead to an existing bug report...The bug report leads to an upstream discussion...

Yadda Yadda Yadda

The rabbitmq-server process is out of file descriptors!

Problem Description: Our compute service on the compute node is stuck in a state of activating

https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/215#discussion_r24977957

https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/215#discussion_r24977957

Now you Know!

Too few RabbitMQ file descriptors is a recipe for

sleepless nights.

Set the rabbitmq-server NOFILE limit to 65436*

*Be careful if you’re using pacemaker -- limits are set by the resource agent.

Knowledge-Centered Support

● Continuous improvement of the knowledgebase simplifies troubleshooting of future issues

● Knowledge automatically captured as a by-product of the problem solving process

● Search and reuse as core disciplines of the support team

● Fast track to publication means easier self-resolution

https://access.redhat.com/solutions/1465753

https://access.redhat.com/solutions/1465753

WE HAVE OUR RIGHT TO SLEEPIssue #2: Random failure while spawning large number of instances

$ nova listERROR (ConnectionRefused): Unable to establish connection to http://192.168.1.1:35357/v2.0/tokens

● Connection to various openstack service APIs (nova-api, cinder-api, neutron-api, etc times out randomly.

● Not reproducible in most of the environments. When it happens, the failure is random without any pattern. Sometimes 1 in 100 or 1 in 500, etc.

● Obviously keystone is up and running perfectly fine.

connection refused!!

neutron-apicinder-apinova-api

Keystone

Issue #2: The symptom is same as issue #1Result: Random failure in spawning instances, creating volumes, networks, etc.

First suspect is Keystone, but he is innocent.Where one can go wrong?

Looking at the error message, It’s natural to point fingers at keystone.

● Looked at keystone api logs. No clue!!● Can see abnormal number of of keystone connections

in CLOSE_WAIT status. Focused and wasted a lot of time by investigating in that direction.

● It’s time to understand how the connections from end user to api and keystone goes by focusing on how the dots are connected.

17

How does it work under the Hood?connection refused!!

haproxy

nova-api keystone

mariadb-galera

haproxy

nova-api keystone

mariadb-galera

haproxy

nova-api keystone

mariadb-galera

VIP

nova-apikeystonedatabase

controller-1 controller-2 controller-3

Possibilities?

Keystone is already ruled out.

● Intermittent network packet drop?● Haproxy (load balancer) drops connection?

end user -> novanova -> keystonekeystone -> database

No, ruled out by network troubleshootingLikely?Highly unlikely as the error is when nova connects to keystone.Slightly likely.Highly likely. Enabled logging and found heavy client termination messages.

haproxy[22346]: 10.243.232.62:48999 [10/Jul/2015:01:41:34.706] galera galera/pcmk-hovsh0800sdc-06 1/0/8734961 37181 cD 1369/1337/1337/1337/0 0/0haproxy[22346]: 10.243.232.14:53092 [10/Jul/2015:02:37:43.666] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1375/1337/1337/1337/0 0/0haproxy[22346]: 10.243.232.62:41742 [10/Jul/2015:01:47:44.819] galera galera/pcmk-hovsh0800sdc-06 1/0/8400246 38448 cD 1376/1336/1336/1336/0 0/0haproxy[22346]: 10.243.232.14:53318 [10/Jul/2015:02:37:47.499] galera galera/pcmk-hovsh0800sdc-06 1/0/5400005 3414 cD 1384/1335/1335/1335/0 0/0haproxy[22346]: 10.243.232.62:42507 [10/Jul/2015:02:37:47.529] galera galera/pcmk-hovsh0800sdc-06 1/0/5400006 2875 cD 1383/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.62:42609 [10/Jul/2015:02:37:49.103] galera galera/pcmk-hovsh0800sdc-06 1/0/5400315 35783 cD 1384/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.62:42684 [10/Jul/2015:02:37:50.598] galera galera/pcmk-hovsh0800sdc-06 1/0/5400259 28994 cD 1384/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.14:53493 [10/Jul/2015:02:37:50.885] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1383/1333/1333/1333/0 0/0haproxy[22346]: 10.243.232.14:53674 [10/Jul/2015:02:37:53.874] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 3498 cD 1404/1335/1335/1335/0 0/0haproxy[22346]: 10.243.232.14:54625 [10/Jul/2015:02:38:11.399] galera galera/pcmk-hovsh0800sdc-06 1/0/5400008 12461 cD 1407/1335/1335/1335/0 0/0

19

galera: sessionsmax: 2000 Limit: 2000

Hold on, but where did I set it? Nowhere!!!

● Then from where this limit comes to effect?This is the default hard coded limit for each

proxy if one is explicitly not defined.● Then why there is no proper error message?

Connection by haproxy is sent into a queue waiting for free database connection, then terminated when it hits timeout.

Haproxy has hit maxconn for galera!listen galera bind 10.243.232.62:3306 mode tcp option tcplog option httpchk option tcpka stick on dst stick-table type ip size 2 timeout client 90m timeout server 90m server controller-1 10.243.232.14:3306 check inter 1s on-marked-down shutdown-sessions server controller-2 10.243.232.15:3306 check inter 1s on-marked-down shutdown-sessions server controller-3 10.243.232.16:3306 check inter 1s on-marked-down shutdown-sessions

global daemon group haproxy maxconn 40000 pidfile /var/run/haproxy.pid user haproxy

defaults log 127.0.0.1 local2 warning mode tcp option tcplog option redispatch retries 3 timeout connect 5s timeout client 30s timeout server 30s

maxconn 2000

20

I solved your problem, can I go and sleep? Hold on..

● It took more time to determine the right value for maximum database connection because it depends on,

○ How many workers are spawned by each api?■ Depends on api_workers/workers configuration for

each service.● Depends on how many cpu cores are there on

each controller? ■ This can differ from deployment to deployment.

○ Each worker process opens five long lived database connection.

○ There are also some short lived connections by each worker.

What should be the maxconn for galera?

Now I can sleep like him.

# Number of workers for OpenStack API service. The default will be the number of CPUs available. (integer value)

21

nova-api24x3 = 72

mariadb-galera

controller-1cores = 24

Based on default deployment by RHEL Openstack Platform Director.What should be the maxconn for galera?

keystone24x2 = 48

neutron-s24x2 = 48

glance-ap24x1 = 24

cinder-api24x1 = 24

glance-re24x1 = 24

nova-con24x1 = 24

nova-api24x3 = 72


keystone24x2 = 48

neutron-s24x2 = 48

glance-ap24x1 = 24

cinder-api24x1 = 24

glance-re24x1 = 24

nova-con24x1 = 24

nova-api24x3 = 72


keystone24x2 = 48

neutron-s24x2 = 48

glance-ap24x1 = 24

cinder-api24x1 = 24

glance-re24x1 = 24

nova-con24x1 = 24

mariadb-galera mariadb-galera

total = 264x5 =1320

Haproxy-VIP Total is 3960

total = 264x5 =1320 total = 264x5 =1320

Add 1024 for:1 - Short lived connections2 - Other services.3 - New services.Total = 4960

22

To sleep like a …..?

Setting the right maxconn value upfront for database proxy can save you from sleepless nights.

● Decide how many worker threads are required by each api for optimum performance. A 96 core system does not need x3 nova worker processes.

● Automate this calculation and set it during deployment time itself.

Both haproxy and for database server. max_connections

● Those use different load balancers, make sure to address this problem, if applicable.

Decide and Set the right value upfront before going to bed.

Proactive alerts

Real-time risk assessment

No infrastructure cost Validated resolution

Tailored resolution

Quick setup

SaaS

Discover the Beta: access.redhat.com/insights

[email protected]@sadiquepp

[email protected]@dustinlblack

THANK YOU

plus.google.com/+RedHat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNewslinkedin.com/company/red-hat

How to Troubleshoot OpenStack Without Losing Sleep

Internet

Transcript of How to Troubleshoot OpenStack Without Losing Sleep