[rabbitmq-discuss] Mirrored queue failover

Wed Apr 11 10:46:56 BST 2012

Hi,

Additionally I noticed the result of list_queues in this situation is

# ./rabbitmqctl list_queues name pid slave_pids
Listing queues ...
que1    <rabbit at rabbit1.3.229.0>        [<rabbit at rabbit3.3.229.0>, <rabbit at rabbit2.1.229.0>]
...done.

But the result of list_queues with synchronised_slave_pids is different.

# ./rabbitmqctl list_queues name pid slave_pids synchronised_slave_pids
Listing queues ...
que1    <rabbit at rabbit1.3.229.0>        [<rabbit at rabbit2.1.229.0>]      [<rabbit at rabbit2.1.229.0>]
...done.

And when I execute list_queues with synchronised_slave_pids, the logs of rabbit3 shows the error every time.

=ERROR REPORT==== 11-Apr-2012::18:01:32 ===
Discarding message {'$gen_call',{<0.203.0>,#Ref<0.0.0.717>},info} from <0.203.0> to <0.229.0> in an old incarnation (3) of this node (2)

Anyway, the que1 will disappear if rabbit1 and rabbit2 are stopped.

kats

(2012/04/06 11:56), Katsushi Fukui wrote:
> Hi Matthew,
>
>  > So in your above example, you stopped rabbit1, which should promote
>  > rabbit2 to master (and there should be log entries indicating that).
>  > Then, even though it looks like there's no slave on rabbit3, try
>
> OK, I have already stopped rabbit1 yesterday. The current queue status is:
>
> # ./rabbitmqctl list_queues -n rabbit at rabbit3 name pid slave_pids
> Listing queues ...
> que1 <rabbit at rabbit2.2.856.0> []
> ...done.
> # ./rabbitmqctl list_queues -n rabbit at rabbit2 name pid slave_pids
> Listing queues ...
> que1 <rabbit at rabbit2.2.856.0> []
> ...done.
>
>  > stopping rabbit2 too, and see if the queue then still exists on rabbit3
>  > - eg a rabbitmqctl -n rabbit at rabbit3 list_queues, and also again check
>  > the logs of rabbit3 to see if there are messages about the promotion of
>  > a slave to master.
>
> And now, I stopped rabbit2 and checked the cluster and queue status.
>
> # ./rabbitmqctl cluster_status
> Cluster status of node rabbit at rabbit3 ...
> [{nodes,[{disc,[rabbit at rabbit3,rabbit at rabbit2,rabbit at rabbit1]}]},
> {running_nodes,[rabbit at rabbit3]}]
> ...done.
> # ./rabbitmqctl list_queues -n rabbit at rabbit3 name pid slave_pids
> Listing queues ...
> ...done.
>
> Mmm..., there is no queue. The logs of rabbit3 just said:
>
> =INFO REPORT==== 5-Apr-2012::17:06:56 ===
> rabbit on node rabbit at rabbit1 down
>
> =INFO REPORT==== 6-Apr-2012::11:26:25 ===
> rabbit on node rabbit at rabbit2 down
>
>
> Thanks,
> Kats
>
>
>
> (2012/04/05 20:26), Matthew Sackman wrote:
>> Hi Kats,
>>
>> (Just popping this back on the mailing list in case others are seeing
>> the same problem)
>>
>> On Thu, Apr 05, 2012 at 05:47:36PM +0900, Katsushi Fukui wrote:
>>> But today I re-built new 3-nodes cluster and got the same situation now (unfortunately). Attached logs.
>>>
>>> The result of list_queues are odd. Now rabbit3 has an error, and list_queues on that node shows different results.
>>> rabbit1:
>>> # ./rabbitmqctl list_queues name durable pid slave_pids synnchronised_slave_pids
>>> Listing queues ...
>>> que1 true <rabbit at rabbit1.1.578.0> [<rabbit at rabbit2.2.856.0>] [<rabbit at rabbit2.2.856.0>]
>>> ...done.
>>>
>>> rabbit2:
>>> # ./rabbitmqctl list_queues name durable pid slave_pids synnchronised_slave_pids
>>> Listing queues ...
>>> que1 true <rabbit at rabbit1.1.578.0> [<rabbit at rabbit2.2.856.0>] [<rabbit at rabbit2.2.856.0>]
>>> ...done.
>>>
>>> rabbit3:
>>> # ./rabbitmqctl list_queues name durable pid slave_pids
>>> Listing queues ...
>>> que1 true <rabbit at rabbit1.1.578.0> [<rabbit at rabbit3.3.705.0>,<rabbit at rabbit2.2.856.0>]
>>> ...done.
>>>
>>> Please check the logs rabbit1-3-script.log.
>>>
>>> If I stop rabbit1 now, the que1 loose all slaves like this:
>>> Listing queues ...
>>> que1<rabbit at rabbit2.2.856.0> []
>>> ...done.
>>
>> I think this is a mis-reporting issue actually - the errors in the logs
>> are indicating more that the querying the slaves for their status is the
>> problem, not that the slaves don't exist. That doesn't mean the slave on
>> 3 *does* exist, but the error doesn't indicate it doesn't, if you see
>> what I mean.
>>
>> Could you repeat the test, and when you get to the same situation (i.e.
>> a slave seems to have vanished), stop both of the other nodes and then
>> check the logs of the "phantom" node.
>>
>> So in your above example, you stopped rabbit1, which should promote
>> rabbit2 to master (and there should be log entries indicating that).
>> Then, even though it looks like there's no slave on rabbit3, try
>> stopping rabbit2 too, and see if the queue then still exists on rabbit3
>> - eg a rabbitmqctl -n rabbit at rabbit3 list_queues, and also again check
>> the logs of rabbit3 to see if there are messages about the promotion of
>> a slave to master.
>>
>> Matthew
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>