[rabbitmq-discuss] Rabbitmq server crash.

Fri Aug 31 10:46:16 BST 2012

Hi

On 29 Aug 2012, at 13:12, Pankaj Mishra wrote:

> Hi,
>  
> We experienced a strange problem with rabbitmq server running in cluster. Actually according to
> Log file the master of the server crashed. Post that all my publishers continue to send message without
> Throwing any exception but all those messages were dropped silently by rabbitmq server. Consumer were not able
> To get any of those messages until we restarted the rabbitmq server again.
>  
> I have attached with this mail the server crash log for master as well as for slave.

According to the master log, the mnesia database has become inconsistent, which is not a good sign. It looks very much like a network partition has occurred here:

%% from master.log

=INFO REPORT==== 7-Aug-2012::18:42:24 ===
rabbit on node rabbit at MyTimes160 down

%% from slave.log

=INFO REPORT==== 7-Aug-2012::18:42:25 ===
rabbit on node rabbit at MyTimes159 down

According to the logs, both the master and the slave observed the other node disappear, which seems consistent with the network partition theory.

%% from master.log

=INFO REPORT==== 7-Aug-2012::18:42:25 ===
Mirrored-queue (queue 'cms' in vhost '/'): Master <rabbit at MyTimes159.3.444.0> saw deaths of mirrors <rabbit at MyTimes160.2.595.0> 

=INFO REPORT==== 7-Aug-2012::18:42:25 ===
Mirrored-queue (queue 'mytimes' in vhost '/'): Master <rabbit at MyTimes159.3.437.0> saw deaths of mirrors <rabbit at MyTimes160.2.591.0> 

%% from slave.log

=INFO REPORT==== 7-Aug-2012::18:42:25 ===
Mirrored-queue (queue 'mytimes' in vhost '/'): Slave <rabbit at MyTimes160.2.591.0> saw deaths of mirrors <rabbit at MyTimes159.3.437.0> 

=INFO REPORT==== 7-Aug-2012::18:42:25 ===
Mirrored-queue (queue 'cms' in vhost '/'): Slave <rabbit at MyTimes160.2.595.0> saw deaths of mirrors <rabbit at MyTimes159.3.444.0> 

Rabbit is not partition tolerant, so I would expect things might go wrong under such circumstances, but I would not expect messages to be silently dropped. My reading of the logs so far is that when the partitioned database state is reached, a message is sent to the gm ring on the 'master' node (the {mnesia_locker,rabbit at MyTimes160,granted} message) which isn't handled, thereby crashing the gm handling process. Once that is down, other things start to go wrong. The parent supervisor will have restarted the failed process to get things back into a consistent state, but it looks as though because mnesia has its knickers in a twist about the partitioned database, that the recovery can't take place properly.

We will look into this asap, but can you confirm that a net split did in fact take place around the time this problem started appearing?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120831/eacaf45b/attachment.htm>