[rabbitmq-discuss] Cluster Pathology

Wed Feb 11 22:19:49 GMT 2009

On Wed, Feb 11, 2009 at 2:11 PM, Jason J. W. Williams
<jasonjwwilliams at gmail.com> wrote:
>>
>>
>> - The reason why the consumer does not get notified about the removal
>> of a queue (for whatever reason it may have disappeared) is because
>> that this behaviour is not specified in the protocol. It is possible
>> that this was an oversight, you would have to seek reference from your
>> local AMQP representative. In practical terms, this notion has been
>> addressed to an extent in the 0-10 version of the protocol, but YMMV.
>> Going forwards, this is exactly the kind of thing that needs to get
>> nailed down in the 1.0 version of the protocol;
>

I understand that this is already becoming, so I hope my comments
don't add too much noise to the thread.  I'm having very similar
issues (or misunderstandings on how RabbitMQ is supposed to work) with
a project at work, and was about to start a new post over this.  We
were doing some more thorough testing on a system running RabbitMQ (v.
1.5.1) with a basic use case:

1. producer A connects to node A, declares exchange E, declares and
binds Q to E *
2. producer B connects to node B, declares exchange E, declares and
binds Q to E *
3. consumer A connects to node A, declares exchange E, binds queue Q
to E, and listens for messages via basic_consume
    => Q exists on node A
4. consumer B connects to node B (binds queue Q to exchange E) and
listens for messages via basic_consume

(* Producers declare and bind queues to ensure messages don't get
"blackholed" - borrowing Jason's terminology which seems very apt).

Now, if node A is taken down, producer A of course can't produce
messages due to socket errors, but producer B continues producing
messages not knowing they actually are just getting dropped with no
queue to route to (since Q is on node A).  What's even more surprising
to me on this matter is that if producer B restarts and publishes
messages, the messages are still blackholed.

Another bad side effect is that once node A comes back up, while
messages are then routed correctly, consumers don't receive messages
unless they're restarted.  My guess is that the consumer tag used by
the consumers is not longer valid?

So far as I can tell there is no easy way to detect either of these scenarios.

> Well that explains why it doesn't happen. :-)

>
>>
>> - Quorum decisions are difficult at the best of times, hence why we
>> would need to think long and hard about how to do transparent
>> replication;
>
> While replication would be nice, I don't really mind having to replay
> the messages later.
>
>>
>> - Replay logic is potentially equally as tricky, once you have
>> considered all of the corner cases;
>
> I can see where it would be tricky, particularly for applications that
> resubmitted messages. Personally, we'd like to see it as either a
> start-up option (--enable-auto-replay) or a separate utility that can
> be pointed at a persister log  with a particular queue name.
>
>>
>> I can tell you right now that Rabbit does not currently cater for
>> these circumstances OOTB, so if these are hard requirements for you,
>> you may want to look somewhere else.
>
> Unfortunately, there's no where else to go. :-) We wrote our code to
> be fairly failure tolerant, and have upgraded the producers to now
> also create exchanges/queues/bindings so nothing gets blackholed. As a
> result, we'll be able to work around it for this project by deploying
> two non-clustered Rabbit instances. It's workable just not optimal.
> We'll have to be careful to upgrade the producers queue creation code
> any time we add new queues and consumer types. Allowing producers to
> be dumb about consumers is really our design target.
>
>>
>> If your application could subscribe to AMQP level events OTOH, there
>> may be a simple way to solve this issue for you in a protocol
>> compliant fashion - for example, we do have patches that allow clients
>> to subscribe to presence events. If the above is a not a KO critiereon
>> for you, we could look into this option.
>

This feels like overkill to me on the client end.  From my naive
perspective, it seems like queue bindings for non-exclusive queues
should be made redundant in a cluster, so that if the node holding the
queue goes down, other nodes can provisionally take responsibility for
routing the messages to consumers' queues and persisting if necessary.

> I'll need to research AMQP level events, but yes we can write it in if
> py-amqplib can support it.That would be fine.
> We don't mind recreating the queues, but consumers need to know
> they're orphaned in edge cases.
>
> Thank you so much for your help! I do really appreciate it and do not
> mean to be a pain.
>
> -J
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>