[rabbitmq-discuss] Cluster Pathology

Wed Feb 11 01:54:05 GMT 2009

HI Dmitriy,

I don't expect the messages to get moved when the node comes back
up....well it would be nice, but I accept that it doesn't happen (in
fact, if the queue has been redeclared on another node during the
downtime, the messages "appear" to be lost when the downed node
returns...I'm sure they could be recovered from the persister log).

The fact that Node A is down is the crux of the issue (maybe it's
power supply blew? maybe it's disks are toast). Anyone subscribed to a
Node A-sourced queue via Node B gets orphaned (they don't get notified
it doesn't exist anymore, and recreation by someone else on Node B
doesn't apply to them). The point of our cluster is failover with the
future possibility of scaling the cluster. But first and foremost,
fail over. If subscribers on Node B get orphaned, they can't help
drain the queue when its been recreated, and it could be some time
before anyone realizes they're just twiddling their thumbs.

I realize message replication in a cluster is not an easy problem. But
I think it's fair to assume the queue itself (w/o messages) will be
migrated to a still alive node, or any subscribers to that queue on a
still alive node will get booted so they don't get orphaned.

-J

On Tue, Feb 10, 2009 at 6:39 PM, Dmitriy Samovskiy
<dmitriy.samovskiy at cohesiveft.com> wrote:
> Hi Jason,
>
> Jason J. W. Williams wrote:
>>
>> Setup A:
>>
>> * Consumer 1 attached to MQ node A and creates queue and binding.
>> * Consumer 2 attaches to MQ node B and creates queue and binding (same
>> as Consumer 1 and therefore no-op'd).
>> * Producer 1 attaches to MQ node B (it also creates queue and binding
>> same as Consumer 1...no-op'd) and publishes messages. Connection to MQ
>> node B is persistent.
>> * Consumer 1 dies.
>> * MQ node A dies. Queues are not recreated on node B, and produced
>> messages are black holed. Queues are not re-created because Consumer 2
>> and Producer 1 are not notified by node B to reconnect or any other
>> error (therefore their reconnect/recreate queue code never gets
>> triggered).
>
> Have you tried restoring node A? When it went down, it might have had some
> messages in the queue. And since contents of queues are not replicated,
> nobody knows about this fact except for node A itself.
>
> When you restore it, maybe rabbit will magically detect that a queue has now
> been re-declared on another node and will migrate unconsumed messages there?
> Or not...
>
>
>> Setup B:
>>
>> * Same as Setup A except:
>> * Producer 1 attaches to MQ node A.
>> * MQ node A and Consumer 1 fail. Producer 1 reconnects to node B and
>> recreates queues and bindings. Messages Producer 1 publishes are
>> placed in the recreated queue. However, Consumer 2 never is handed
>> messages by node B (which it has been persistently attached to) after
>> the recreation of the queue.
>
> Maybe because consumer 2 is still attached to a queue that is on a node that
> is down? I suspect that when you create a binding by name, rabbit resolves
> the name string to its internal locator for a specific queue, which in this
> case is on node A which is down? I would guess that if you restart the
> consumer it will attach to newly created queue.
>
> But again, the question will remain how one can get messages from an old
> queue named "foo" when a new queue "foo" now exists on another node.
>
>
> - Dmitriy
>