[rabbitmq-discuss] Node failure for a mirrored queue

Mon Feb 11 10:00:14 GMT 2013

11.02.2013, в 12:49, Tim Watson <watson.timothy at gmail.com> написал(а):

> Hi Vladimir,
> 
> On 11 Feb 2013, at 07:52, Бородин Владимир wrote:
> 
>> Hi all.
>> 
>> I'm testing mirrored queues in v. 3.0.1. If I stop node with 'rabbitmqctl stop_app', clients and other nodes behave normally (clients reconnect to other nodes because of a tcp balancer and other nodes continue to serve the queue). But if I close one node from others with iptables or kill it with Alt+SysRq+b, the cluster stops working for a long period of time.
> 
> What exactly stops working? The whole cluster, all queues/exchanges are inaccessible? Or just this particular mirrored queue? 

Producers can't put messages in queue, consumers can't take them from queue. I can not even see the result of running 'rabbitmqctl list_queues' or 'rabbitmqctl cluster_status'.

> 
>> Is there any kind of a timeout, after which the node is considered to be dead by others?
> 
> No, although the os networking stack can take a while to notice peers are gone. Erlang does have a kind of heartbeat mechanism though, which should notice in a fairly timely fashion that another node has gone away. How long does the 'long period of time' last exactly?

Long period means about a minute for seeing the death of a node. I don't think this is normal for heartbeat messages. I'm using RHEL6 with default sysctl timeouts. I can show appropriate kernel parameters if it can help.

In the master log I can see that in period of a bit more than a minute (~ 70 sec) master saw death of a mirror:
=ERROR REPORT==== 11-Feb-2013::13:07:11 ===
** Node rabbit at loadtest04g not responding **
** Removing (timedout) connection **

=INFO REPORT==== 11-Feb-2013::13:07:11 ===
rabbit on node rabbit at loadtest04g down

=INFO REPORT==== 11-Feb-2013::13:07:19 ===
Mirrored-queue (queue 'loadtest01g.domain.com.celery.pidbox' in vhost '/'): Master <rabbit at loadtest03g.1.10550.4> saw deaths of mirrors <rabbit at loadtest04g.2.28501.0>

And in the live slave log I see:
=INFO REPORT==== 11-Feb-2013::13:07:03 ===
rabbit on node rabbit at loadtest04g down

=INFO REPORT==== 11-Feb-2013::13:07:19 ===
Mirrored-queue (queue 'celery' in vhost '/'): Slave <rabbit at loadtest05g.1.18027.0> saw deaths of mirrors <rabbit at loadtest04g.2.28499.0>

The clients are actually using queue with name 'celery'.

So I suppose there are two problems:
1. the timeout tuned by the kernel parameters of erlang (this is better) or OS (this is a bit worse because it can affect other applications). Is there a way to make it smaller in erlang?
2. if the live nodes saw the death of killed node, why commands like 'rabbitmqctl cluster_status' don't work? I can understand if the queue does not work due to wrong policy for a queue, but it should not affect the whole cluster, should it?

> 
>> It does not depend on which node I kill - the primary or one of the slaves. There are 3 nodes in a cluster, the queue is mirrored by a policy like that '/	HA	^(?!amq\\.).*	{"ha-mode":"all"}	0'.
>> If I should give extra info for understanding of a problem, tell me, please. Thanks.
>> 
>> --
>> Vladimir
>> 
>> 
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

--
Vladimir

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130211/aa42ee6d/attachment.htm>