[rabbitmq-discuss] Cluster Pathology

Fri Feb 13 15:34:01 GMT 2009

On Fri, Feb 13, 2009 at 5:26 AM, Ben Hood <0x6e6562 at gmail.com> wrote:
> Drew,
>
> On Thu, Feb 12, 2009 at 7:31 PM, Drew Smathers <drew.smathers at gmail.com> wrote:
>> Steps to reproduce:
>>
>> 1. run publisher and consumer against one node to ensure queue is created there:
>>
>>  $ python publisher.py hostB 5
>>  $ python consumer.py hostB # CTL-C after receiving 5 messages
>>
>> 2. run publisher/consumer against other node - hostA
>>
>>  $ python publisher.py hostA 20
>>  $ python consumer.py hostA
>>
>> 3. Before publisher from step 2 has finished, bring down rabbitmq on hostB
>>
>>  hostB $ rabbitmqctl stop
>>
>> 4. After publisher from step 2 has finished, restart consumer:
>>
>>  $ python consumer.py hostA
>>
>> Notice messages delivered after hostB was brought down were not delivered.
>
> Yes, this is the behaviour I would expect as well. As indicated
> previously (on this thread and the other related one) this is because
> the queue to which both consumers are subscribed was initially
> declared on node B. Because there is
>
> a) no automatic failover, just recovery;
> b) no propagation of the queue removal event to each consumer (the
> spec compliancy issue);
>
> the queue is taken down and the guy consuming via node A will be none
> the wiser. Any subsequent messages published to that queue will be
> treated as unroutable and hence will be discarded. To recover from
> this situation, you would need to restart node B and restart the
> consumer on node A.
>

Thanks for the information.  The consumer is not as much my concern as
the publisher (also attached to A) who would continue publishing
messages which should be delivered but get discarded. (Btw, it's still
_very_ unclear to me who to get notification that a message cannot be
routed.)  I'm solving this issue by making publisher attach to only
one node where the queue is defined so socket errors would stop the
publisher; this is appropriate for our system where there are very few
publishers but many consumers.  We're also keeping a rotating log for
critical messages as another point of recovery in the event a
persister log cannot be recovered.  I haven't finalized what to do
from the consumers' perspective except perhaps having some activity
monitor with a timeout to trigger reestablishing a channel, queue
declarations, etc.  Any ideas how to best handle this without to many
complications such as AMQP-level events, etc?

> Obviously it would be nice to have better handling for this kind of
> thing, which will probably happen at some stage.
>

Yes, please :)  If there are significant performance impacts, I think
it would still be to nice to have as an optional runtime configuration
for applications where "(99.99999%) guaranteed delivery" is a
requirement more than overall throughput.

-Drew