[rabbitmq-discuss] RabbitMQ 3.2.2 cluster shutting down

Thu Mar 6 15:19:30 GMT 2014

On 06/03/14 15:04, Andy Martin wrote:
> ** {{badmatch,{error,not_found}},
>      [{rabbit_amqqueue_process,i,2,[]},
>       {rabbit_amqqueue_process,'-infos/2-lc$^0/1-0-',2,[]},
>       {rabbit_amqqueue_process,'-infos/2-lc$^0/1-0-',2,[]},
>       {rabbit_amqqueue_process,emit_stats,2,[]},

That's the exact stack trace of a bug we fixed in 3.2.1. Are you 
completely sure that both nodes are running 3.2.2?

> I have a two-node autoheal cluster. Yesterday I came in to find out
> that neither server in the cluster was seeing the other, but there
> was no message about a partition having occurred.

You will only be notified about the partition once the servers can see 
each other again - until that happens, A thinks B is down, and B thinks 
A is down; they have no way to know it's a partition.

> Today I came in to
> find one of the servers down. I can't find anything in the logs that
> would indicate why this is happening. What should I be looking for in
> the logs?

"Down" as in the Erlang VM had exited, or "down" as in the Erlang VM was 
still running but the RabbitMQ application had stopped?

If the former, then it can't be anything to do with autoheal since that 
doesn't attempt to stop the node completely. You might want to look for 
"Halting Erlang VM" in the logs; that would imply the broker was shut 
down deliberately. If that's not there then please post the logs 
somewhere with an indication of when you became aware the server was down.

If the latter, it's possible that you ran into one of the bugs we fixed 
in 3.2.4, where an attempt at autoheal can get stuck. There should still 
be messages about autoheal in the logs in that case though.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal