[rabbitmq-discuss] RabbitMQ hanging during 'rabbitmqctl stop' - please help

Fri Jul 26 17:45:49 BST 2013

On 07/26/2013 05:30 AM, Simon MacMullen wrote:
> On 23/07/2013 3:39PM, Casey Marshall wrote:
>> I was able to automate creation and distribution of certificates among
>> the servers, and the setup works fine in a local KVM-virtualized
>> environment. However, when I started testing it in EC2 across regions, I
>> started having problems. RabbitMQ hangs on some of the servers in EC2
>> during 'service rabbitmq-server restart'.
> 
> We're aware of a bug that could cause deadlocks on shutdown. We've
> thought of it as something that people are quite unlikely to run into
> though, so I'm surprised you're seeing it so frequently.
> 

It's helpful to know that this might be a known deadlock I'm dealing with.

>> I have scripts to automate updating the federated exchanges, to effect
>> topology changes on the "federated group" -- adding or removing a
>> server, or changing its role (different roles have different exchange
>> upstreams). After this script updates the federation config, the
>> subsequent restart hangs on some of the servers.
> 
> However, it doesn't have anything to do with reconfiguring federation.
> 
> Are you restarting the server immediately after reconfiguring
> federation, or some time later?
> 

I'm restarting the server immediately after, on each federated node, in
parallel with Ansible.

I have more information from recent trial-and-error attempts at a
workaround. The shutdown hang happens fairly consistently under the
following conditions:

1. Federated connection URIs are amqps://. If I use unsecured amqp:// no
shutdown hang.
2. Number of servers > 2
3. In EC2, especially across different geographic regions. If I set up a
three-server demo among local KVM-virtual machines, I haven't seen a
shutdown hang.

I suspect something about the other servers' upstream exchange
connection attempts, *during* the restart of a server, might have
something to do with it.

Is there a workaround I might try to avoid the deadlock? For example,
maybe a rabbitmqctl command to close & prevent new federated connections
while all the servers are reconfiguring? I don't necessarily think its a
federation issue, maybe just the concurrent client connections getting
to the wrong place at the wrong time?

> Cheers, Simon
> 

Thanks!
Casey