[rabbitmq-discuss] BUG: Failing nodes with HA-queue stops message flow

Mon Oct 1 15:07:34 BST 2012

On Mon, 2012-10-01 at 13:59:23 +0100, Matthias Radestock wrote:
> On 01/10/12 13:48, Simon Lundström wrote:
> >It's when/after ram01 restarts that the consumer disappears from
> >"Consumers" in the queue and message passing stops.
> 
> Are you handling the consumer cancellation notification? From
> http://www.rabbitmq.com/ha.html
> <quote>
> Clients that were consuming from the mirrored-queue [and were
> connected to a node other than the one that died] will receive a
> notification that their subscription to the mirrored-queue has been
> abruptly cancelled. At this point they should re-consume from the
> queue, which will pick up the new master.
> </quote>

I have no idea how *I* should be handling consumer cancel notification
(can't find this in the documentation), but I'm pretty sure Ruby AMQP
handles it, see
<https://github.com/ruby-amqp/amq-client/commit/49951878a8ff4bdbcead3c13eb7737ade6e70669>
and <https://groups.google.com/forum/#!topic/ruby-amqp/KKxv4DFVJVk>.

How can I see the client properties (to check if my consumer reports
support for CCN) in the management GUI?

Under Connections => Consumer IP, Client properties I only get:
information http://github.com/ruby-amqp/amqp
platform  ruby 1.8.7 (2012-02-08 patchlevel 358) [universal-darwin11.0]
version 0.9.7
product AMQP gem

This,
<https://groups.google.com/forum/?fromgroups=#!topic/ruby-amqp/TA4n5Eq4IgU> is
the only thing related to CCN I can find on Ruby AMQP.
I'll ask on their mailinglist.

> >Admittedly I'm new to AMQP and RabbitMQ, but I don't get why the
> >consumer which is connected to disk01 would start failing when ram01 is
> >being restarted.
> 
> The docs explain that:
> <quote>
> The reason for sending this notification is that informing clients
> of the loss of the master is essential: otherwise the client may
> continue to issue acknowledgements for messages they were sent by
> the old, failed master [these acks will be discarded], and not
> expect that they might be about to see the same messages again, this
> time sent by the new master.
> </quote>

Aaah, and the reason for this not happening when disk01 is master and
then restarted is because the consumer is connected to disk01 and must
reconnect (because it just lost the connection)?

Oh, why mailed here in the first place was becuase Emile thought it was
strange that ram01 threw this message:
=ERROR REPORT==== 1-Oct-2012::13:55:35 ===
Discarding message {'$gen_call',{<0.173.0>,#Ref<0.0.0.967>},{notify_down,<6733.320.0>}} from <0.173.0> to <0.200.0> in an old incarnation (1) of this node (2)

But maybe that's when the consumer doesn't handle basic.cancel?

Thanks!
- Simon