[rabbitmq-discuss] Mirrored queue failover

Katsushi Fukui ka.fukui at ms.scsk.jp
Fri Apr 6 03:56:25 BST 2012


Hi Matthew,

 > So in your above example, you stopped rabbit1, which should promote
 > rabbit2 to master (and there should be log entries indicating that).
 > Then, even though it looks like there's no slave on rabbit3, try

OK, I have already stopped rabbit1 yesterday. The current queue status is:

# ./rabbitmqctl list_queues -n rabbit at rabbit3 name pid slave_pids
Listing queues ...
que1    <rabbit at rabbit2.2.856.0>        []
...done.
# ./rabbitmqctl list_queues -n rabbit at rabbit2 name pid slave_pids
Listing queues ...
que1    <rabbit at rabbit2.2.856.0>        []
...done.

 > stopping rabbit2 too, and see if the queue then still exists on rabbit3
 > - eg a rabbitmqctl -n rabbit at rabbit3 list_queues, and also again check
 > the logs of rabbit3 to see if there are messages about the promotion of
 > a slave to master.

And now, I stopped rabbit2 and checked the cluster and queue status.

# ./rabbitmqctl cluster_status
Cluster status of node rabbit at rabbit3 ...
[{nodes,[{disc,[rabbit at rabbit3,rabbit at rabbit2,rabbit at rabbit1]}]},
  {running_nodes,[rabbit at rabbit3]}]
...done.
# ./rabbitmqctl list_queues -n rabbit at rabbit3 name pid slave_pids
Listing queues ...
...done.

Mmm..., there is no queue. The logs of rabbit3 just said:

=INFO REPORT==== 5-Apr-2012::17:06:56 ===
rabbit on node rabbit at rabbit1 down

=INFO REPORT==== 6-Apr-2012::11:26:25 ===
rabbit on node rabbit at rabbit2 down


Thanks,
Kats



(2012/04/05 20:26), Matthew Sackman wrote:
> Hi Kats,
>
> (Just popping this back on the mailing list in case others are seeing
> the same problem)
>
> On Thu, Apr 05, 2012 at 05:47:36PM +0900, Katsushi Fukui wrote:
>> But today I re-built new 3-nodes cluster and got the same situation now (unfortunately). Attached logs.
>>
>> The result of list_queues are odd. Now rabbit3 has an error, and list_queues on that node shows different results.
>> rabbit1:
>> # ./rabbitmqctl list_queues name durable pid slave_pids synnchronised_slave_pids
>> Listing queues ...
>> que1	true	<rabbit at rabbit1.1.578.0>	[<rabbit at rabbit2.2.856.0>]	[<rabbit at rabbit2.2.856.0>]
>> ...done.
>>
>> rabbit2:
>> # ./rabbitmqctl list_queues name durable pid slave_pids synnchronised_slave_pids
>> Listing queues ...
>> que1	true	<rabbit at rabbit1.1.578.0>	[<rabbit at rabbit2.2.856.0>]	[<rabbit at rabbit2.2.856.0>]
>> ...done.
>>
>> rabbit3:
>> # ./rabbitmqctl list_queues name durable pid slave_pids
>> Listing queues ...
>> que1	true	<rabbit at rabbit1.1.578.0>	[<rabbit at rabbit3.3.705.0>,<rabbit at rabbit2.2.856.0>]
>> ...done.
>>
>> Please check the logs rabbit1-3-script.log.
>>
>> If I stop rabbit1 now, the que1 loose all slaves like this:
>> Listing queues ...
>> que1<rabbit at rabbit2.2.856.0>         []
>> ...done.
>
> I think this is a mis-reporting issue actually - the errors in the logs
> are indicating more that the querying the slaves for their status is the
> problem, not that the slaves don't exist. That doesn't mean the slave on
> 3 *does* exist, but the error doesn't indicate it doesn't, if you see
> what I mean.
>
> Could you repeat the test, and when you get to the same situation (i.e.
> a slave seems to have vanished), stop both of the other nodes and then
> check the logs of the "phantom" node.
>
> So in your above example, you stopped rabbit1, which should promote
> rabbit2 to master (and there should be log entries indicating that).
> Then, even though it looks like there's no slave on rabbit3, try
> stopping rabbit2 too, and see if the queue then still exists on rabbit3
> - eg a rabbitmqctl -n rabbit at rabbit3 list_queues, and also again check
> the logs of rabbit3 to see if there are messages about the promotion of
> a slave to master.
>
> Matthew
>



More information about the rabbitmq-discuss mailing list