[rabbitmq-discuss] Queue disappear from nodes in clusters

Fri Jul 26 00:54:58 BST 2013

Hi Simon,

I think the log files are too big (2Gb) to upload here, even after I
compress it.
However, these are some lines like this in node2
/
Mirrored-queue (queue 'MyQueue_Priority3' in vhost '/'): Master
<rabbit at queue2.2.2114.1> saw deaths of mirrors <rabbit at queue3.1.365.1> 
/

"MyQueue_Priority3" was the one vanished.

There are 3 nodes: queue1 (disc, stats), queue2 (ram), queue3 (ram) 
What I did was trying enable federation plugins on these nodes, first on
queue3, then queue2 then queue1

On queue3: 
rabbitmq-plugins enable rabbitmq_federation
rabbitmq-plugins enable rabbitmq_federation_managements
rabbitmqctl stop
rabbitmq-server -detached

Then queue 3 up fine, I started the same thing on queue2 and queue2 up fine.
Finally when I started it on queue1, queue 3 promoted, queue1 takes quite
along time to start as far as I remembered and then the weird things started
to happen :D. I assume all exchanges and queues are mirrored across nodes in
this cluster so even 1 node down, the data should not be lost? Or is it
because the disc node down will cause the lost of queue definition? I'm
pretty sure at that time all queues and exchanges are synced.

I also used wan IPs and put the IPs in the host file before clustering, is
it the problem. 

The document (http://www.rabbitmq.com/distributed.html) wrote that:
/
Brokers *must be* connected via reliable LAN links. Communication is via
Erlang internode messaging, requiring a shared Erlang cookie.
/
And (http://www.rabbitmq.com/partitions.html#cp-mode)

/Which mode should I pick?

*ignore *- Your network really is reliable. All your nodes are in a rack,
connected with a switch, and that switch is also the route to the outside
world. You don't want to run any risk of any of your cluster shutting down
if any other part of it fails (or you have a two node cluster).

*pause_minority* - Your network is maybe less reliable. You have clustered
across 3 AZs in EC2, and you assume that only one AZ will fail at once. In
that scenario you want the remaining two AZs to continue working and the
nodes from the failed AZ to rejoin automatically and without fuss when the
AZ comes back.

*autoheal *- Your network may not be reliable. You are more concerned with
continuity of service than with data integrity. You may have a two node
cluster.
/

I may choose *pause_minority* then. I will change to use the private IPs
(provided by the host provider), hopefully will be better. By the way, do
you recommend a clustering of 3 nodes or 2 nodes?

--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Queue-disappear-from-nodes-in-clusters-tp28393p28404.html
Sent from the RabbitMQ mailing list archive at Nabble.com.