[rabbitmq-discuss] Queue disappears during partition/autoheal

Wed Apr 16 13:08:46 BST 2014

On 15/04/14 23:09, Matt Pietrek wrote:
> This is rabbitmq 3.2.4, running in a 2 node cluster with all queues in ha.

> At some point we saw a network partition (see below). It appears that
> Autoheal eventually worked, but afterwards the cmcmd queue wasn't on the
> broker.

> =ERROR REPORT==== 14-Apr-2014::18:02:30 ===
> ** Generic server <0.204.0> terminating
> ** Last message in was {mnesia_locker,rabbit at sea5m1mq1,granted}
> ** When Server state == {state,2,{from,<0.302.0>,#Ref<0.0.1372.163190>}}
> ** Reason for termination ==
> ** {unexpected_info,{mnesia_locker,rabbit at sea5m1mq1,granted}}

So this is something we've seen before in the case of short-lived 
partitions; something in Mnesia is sending a stray {mnesia_locker, ..., 
...} message to a process that isn't expecting it after the partition, 
killing the process in question.

The release notes for Erlang 17.0 contain:

OTP-11497  To prevent a race condition if there is a short communication
            problem when node-down and node-up events are received. They
            are now stored and later checked if the node came up just
            before mnesia flagged the node as down. (Thanks to Jonas
            Falkevik )

which sounds like the same thing.

So it is quite possible that this is fixed in Erlang 17.0.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal