[rabbitmq-discuss] HA Queues lost when a node dies

Venkat Morampudi venkatmorampudi at gmail.com
Sat Jun 30 01:52:21 BST 2012


Francesco Mazzoli <francesco at ...> writes:

> 
> Hi Bozhidar,
> 
> It's hard to tell what happened without looking at the logs and without 
> knowing your setup; but a number of severe bugs related to HA were fixed 
> in 2.8.2,  so it's definitely worth a try.
> 
> If the situation does not improve, please post more details on the list.
> 
> Francesco.
> 
> On 07/05/12 17:44, Bozhidar Bozhanov wrote:
> > Hi,
> >
> > We are currently trying to run RabbitMQ (2.8.1) in a cluster and use
> > highly-available queues. We have around 50 queues. Each queue is
> > registered with one of the nodes (at random), as master, and using
> > x-ha-policy=all. We have 2 nodes in the cluster.
> >
> > The management console shows that the cluster is successfully created,
> > and that the queues are highly-available and properly mirrored. Then
> > we kill one of the nodes (with kill -9) to simulate system failure. We
> > have tried this five times, and each time a different result was
> > observed:
> > - only 1 queue 'survived' (the metadata about the others was deleted
> > and they were not visible in the management console, nor we could send
> > or consume messages to/from them)
> > - all but 3 queues survived
> > - only 10 queues survived
> > - all queues survived
> > - all but 1 queue survived
> >
> > The queues that survived properly switched their master node to the
> > only remaining one.
> >
> > The results are random, as it seems. Is this expected behaviour? Is it
> > likely to be fixed in 2.8.2. And how can we make sure that if a node
> > dies, the queues don't get deleted.
> > _______________________________________________
> > rabbitmq-discuss mailing list
> > rabbitmq-discuss at ...
> > https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> 
> 
Hi Francesco,

I observed similar behavior during my HA testing on V2.8.1. RabbitMQ dropping 
queue randomly when I tried to simulate node failure by stopping rabbitmq 
service.

Errors logged in RabbitMQ log file:

=ERROR REPORT==== 29-Jun-2012::19:15:42 ===
** Generic server <0.240.0> terminating
** Last message in was {'$gen_cast',{gm_deaths,[<6886.642.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"TestQueue5">>},
                                true,false,none,
                                [{<<"x-ha-policy">>,longstr,<<"all">>}],
                                <0.223.0>,[],all},
                            <0.241.0>,
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            #Fun<rabbit_mirror_queue_master.1.2951048>,
                            #Fun<rabbit_mirror_queue_master.2.72654940>}
** Reason for termination ==
** {{case_clause,{ok,<0.441.0>,[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2},
     {gen_server2,handle_msg,2},
     {proc_lib,wake_up,3}]}

=ERROR REPORT==== 29-Jun-2012::19:15:42 ===
** Generic server <0.238.0> terminating
** Last message in was {'$gen_cast',{gm_deaths,[<6886.663.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"TestQueue4">>},
                                true,false,none,
                                [{<<"x-ha-policy">>,longstr,<<"all">>}],
                                <0.222.0>,[],all},
                            <0.239.0>,
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            #Fun<rabbit_mirror_queue_master.1.2951048>,
                            #Fun<rabbit_mirror_queue_master.2.72654940>}
** Reason for termination ==
** {{case_clause,{ok,<0.451.0>,[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2},
     {gen_server2,handle_msg,2},
     {proc_lib,wake_up,3}]}

I can provide complete log file to you if it help debugging.

Really appreciate your help.

Thanks,

-Venkat



More information about the rabbitmq-discuss mailing list