[rabbitmq-discuss] AWS clustering

Wed Sep 5 04:01:38 BST 2012

Hi all,

In a project I'm working on we set up a cluster of application nodes, each 
with RabbitMQ installed. Everybody can talk to everybody, and we can scale 
the number of nodes pretty easily. But on more than one occasion, we have 
seen mnesia become partitioned. You've seen this before:

*Mnesia('rabbit at app-6'): ** ERROR ** mnesia_event got 
{inconsistent_database, running_partitioned_network, 'rabbit at app-5'}*

As best as we can tell, this is caused by temporary network outages, or 
possibly high-load conditions, or possibly the nexus of both. However it 
happens, you end up with one or more nodes down for the count with 
non-deterministic behavior (messages sent to that node may or may not reach 
other nodes). It doesn't recover until you *manually* stop_app/start_app. 
And if it happened to be a disc node, *rm -rf 
/var/lib/rabbitmq/mnesia/rabbitmq/** in between.

For a supposedly "just works" kind of service, that is just not good 
enough. I can't have my ops people rolling out of bed to take action every 
time there's a minor network glitch. So, I either need to provide a network 
that never becomes partitioned (does such a network exist? Certainly not at 
AWS!), or I need to drop clustering and have a single RabbitMQ server which 
won't scale, or I need to cobble together some kind of automated supervisor 
which is certain not to handle all cases, or I need to use a different 
messaging tool.

Please, somebody dispute my conclusion because I would love to continue 
using RabbitMQ.

Best regards,
Glade

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120904/ed3a8ae9/attachment.htm>