[rabbitmq-discuss] Node crash, then cluster collapse

Thu Jun 6 11:04:37 BST 2013

Carl

On 5 Jun 2013, at 13:56, carlhoerberg <carl.hoerberg at gmail.com> wrote:
> On a three node cluster, one ec2 machine reboots unexpectedly, and when it
> starts up again RabbitMQ fails to start. I've put all logs here:
> https://gist.github.com/carlhoerberg/ff6c6bd4f7639bf4b2f5
> 

That seems to contain only the logs from one node, what about the others?

> When the troubled node is restarted manually again it's unable to join,
> stopping at "adding mirrors", staying there forever. 
> 
> The other nodes now start to behave weird too, new queues can't be declared,
> but existing queues seems to continue deliver messages. They also can't
> respond to "rabbitmqctl status", or /api/overview. I'm forced to stop them
> with "kill -9". Only when all nodes are stopped the cluster can be brought
> up again normally. 

If you kill -9 the nodes, it's a bit tricky to get live info for diagnosis, assuming there's nothing in the logs. If the logs are available, please post them. Next time this happens, jump on irc (the #rabbitmq channel on freenode) and  we can try a few things to diagnose what's going on. If you can arrange for me to have ssh access to these nodes whilst the symptoms are present, I'll be more likely to solve the issue quickly - we might be able to sign some kind of privacy agreement if necessary.

Also please post your full setup whenever possible, detailing which plugins you're using (if any) and what kind of ha setup you're using.

Cheers,
Tim