[rabbitmq-discuss] Mnesia corrupting after node joining cluster

Wed Aug 1 19:15:42 BST 2012

 Hi Matthias,

 thanks for the quick answers.

> Also, are you sure you actually *need* clustering? It does, by necessity, 
> add a significant amount of complexity and possible failure modes, so only 
> use it if you have to.

 To implement distributed messaging, my understanding is I've got three
 options, cluster, federation, shovel.  In my case I've got a simple office
 network with 16 machines.  The apps that
 run on the machines need to see all the messages that get published, so I
 have a simple fanout exchange.  Each app instantiates it's own distinct, 
non
 persistant queue.  So a cluster seemed to be the simplest choice, in that
 there is just a single fanout exchange and a simple transient queue for 
each
 app.

 Is there any non clustered way to enable distribuited messaging in a single
 location (other than Federation and the Shovel)?

 Thanks much
David

> ----- Original Message ----- 
> From: "Matthias Radestock" <matthias at rabbitmq.com>
> To: "David Brown" <dbrown at prmllc.com>; "Discussions about RabbitMQ" 
> <rabbitmq-discuss at lists.rabbitmq.com>
> Sent: Wednesday, August 01, 2012 12:46 PM
> Subject: Re: [rabbitmq-discuss] Mnesia corrupting after node joining 
> cluster
>
>
>> David,
>>
>> On 01/08/12 18:20, David Brown wrote:
>>> has there been any work on this issue (i.e. errors when doing admin work
>>> on a cluster)?
>>
>> yes, but it will likely be a few months before these changes make it into 
>> a release.
>>
>>> I've got a tiny, two node development cluster.  Removing
>>> the ram node caused the remaining (disc) node to fail on startup with a
>>> mnesia related error when I restarted it.  Eventually, the remaining
>>> (disc) node started up, but it still thinks the other node is clustered
>>> with it.  I've tried everything to try and get this node to realize it
>>> is the only node left in the cluster, nothing works.  FWIW the removed
>>> node does realize it is no longer part of the two node cluster.
>>>
>>> I was very careful in terms of following the exact steps in the
>>> 'Breaking up a cluster' section on the rabbitmq web site.
>>
>> Hmm. It certainly looks like the disk node still thinks it is clustered 
>> with the ram node, and consequently it will fail to merge its schema with 
>> it when starting. That really shouldn't happen if you indeed followed the 
>> documented steps when breaking up the cluster, in particular the disk 
>> node was up and running when removing the ram node.
>>
>> If you can reproduce the problem then please post a transcript of the 
>> commands.
>>
>>> At this point, I'm a bit concerned about basing our production
>>> systems around rabbitmq (we're a small hedge fund) when it seems to
>>> fail on the simplest of tasks.
>>
>> Clustering is hardly the "simplest of tasks" - sending and receiving 
>> messages is ;)
>>
>> The clustering code hasn't changed much in 4+ years. It is stable but 
>> suffers from little mistakes resulting in situations that are hard to 
>> recover from - that's what Francesco is addressing.
>>
>> I suggest you conduct some more experiments and keep transcripts of 
>> everything you are doing. Then, if you do encounter a weird situation it 
>> will be much easier to reproduce and diagnose the problem.
>>
>> Also, are you sure you actually *need* clustering? It does, by necessity, 
>> add a significant amount of complexity and possible failure modes, so 
>> only use it if you have to.
>>
>> Regards,
>>
>> Matthias.
>>
>>
>