[rabbitmq-discuss] Rabbitmq server crash.

Tim Watson tim at rabbitmq.com
Fri Aug 31 15:50:30 BST 2012


Hi

On further investigation, this error condition (verified by the presence of '{mnesia_locker,rabbit at MyTimes160,granted}' in the gm process mailbox) is indeed the result of a netsplit. In RabbitMQ, clustering (and by association, HA/mirror queues) are not partition tolerant, and therefore netsplits *will* cause errors like this to occur. If you cannot rely on the network links between your clustered nodes then you should consider another approach to distribution, such as federation.

For more details about this, see the distribution guide (http://www.rabbitmq.com/distributed.html) and in particular note these comments from the 'Summary' section:

Federation / Shovel	Clustering

Chooses Availability and Partition Tolerance from the CAP theorem.         	 Chooses Consistency and Availability from the CAP theorem.         

So if you want Consistency (guarantees) and Availability, you should go with clustering and HA, but if you want Availability *and* Partition tolerance, then Clustering/HA is not the right setup for you. Also if this is happening once a month, then I'd suggest looking at what the network admin team is doing around that time, to see if some kit (or software) is being changed, reconfigured and/or taken offline for maintenance during this time period.

Cheers,
Tim 

On 31 Aug 2012, at 10:46, Tim Watson wrote:

> Hi
> 
> On 29 Aug 2012, at 13:12, Pankaj Mishra wrote:
> 
>> Hi,
>>  
>> We experienced a strange problem with rabbitmq server running in cluster. Actually according to
>> Log file the master of the server crashed. Post that all my publishers continue to send message without
>> Throwing any exception but all those messages were dropped silently by rabbitmq server. Consumer were not able
>> To get any of those messages until we restarted the rabbitmq server again.
>>  
>> I have attached with this mail the server crash log for master as well as for slave.
> 
> According to the master log, the mnesia database has become inconsistent, which is not a good sign. It looks very much like a network partition has occurred here:
> 
> %% from master.log
> 
> =INFO REPORT==== 7-Aug-2012::18:42:24 ===
> rabbit on node rabbit at MyTimes160 down
> 
> %% from slave.log
> 
> =INFO REPORT==== 7-Aug-2012::18:42:25 ===
> rabbit on node rabbit at MyTimes159 down
> 
> According to the logs, both the master and the slave observed the other node disappear, which seems consistent with the network partition theory.
> 
> %% from master.log
> 
> =INFO REPORT==== 7-Aug-2012::18:42:25 ===
> Mirrored-queue (queue 'cms' in vhost '/'): Master <rabbit at MyTimes159.3.444.0> saw deaths of mirrors <rabbit at MyTimes160.2.595.0> 
> 
> =INFO REPORT==== 7-Aug-2012::18:42:25 ===
> Mirrored-queue (queue 'mytimes' in vhost '/'): Master <rabbit at MyTimes159.3.437.0> saw deaths of mirrors <rabbit at MyTimes160.2.591.0> 
> 
> %% from slave.log
> 
> =INFO REPORT==== 7-Aug-2012::18:42:25 ===
> Mirrored-queue (queue 'mytimes' in vhost '/'): Slave <rabbit at MyTimes160.2.591.0> saw deaths of mirrors <rabbit at MyTimes159.3.437.0> 
> 
> =INFO REPORT==== 7-Aug-2012::18:42:25 ===
> Mirrored-queue (queue 'cms' in vhost '/'): Slave <rabbit at MyTimes160.2.595.0> saw deaths of mirrors <rabbit at MyTimes159.3.444.0> 
> 
> 
> Rabbit is not partition tolerant, so I would expect things might go wrong under such circumstances, but I would not expect messages to be silently dropped. My reading of the logs so far is that when the partitioned database state is reached, a message is sent to the gm ring on the 'master' node (the {mnesia_locker,rabbit at MyTimes160,granted} message) which isn't handled, thereby crashing the gm handling process. Once that is down, other things start to go wrong. The parent supervisor will have restarted the failed process to get things back into a consistent state, but it looks as though because mnesia has its knickers in a twist about the partitioned database, that the recovery can't take place properly.
> 
> We will look into this asap, but can you confirm that a net split did in fact take place around the time this problem started appearing?
>  
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120831/6ee2a9a2/attachment.htm>


More information about the rabbitmq-discuss mailing list