[rabbitmq-discuss] active/active HA cluster appears to be deadlocked

Wed Sep 7 02:37:45 BST 2011

I've tried to duplicate this and the second time around the cluster is
behaving like I would have expected.  Unfortunately, I stupidly
overwrote the old cluster so I can't go back to it for any deeper
inspection.

After trawling the archives a bit, it seems that the only way to bring
a dead node out of the cluster is to create a new node that
masquerades as the old node.  That seems pretty messy to me.  Am I
correct about this?  Are there plans to address this in a future
release?

Thanks!
Bryan

On Tue, Sep 6, 2011 at 3:34 PM, Bryan Murphy <bmurphy1976 at gmail.com> wrote:
> Additional info,
>
> I killed the rabbitmq process on of the servers in the cluster.  I was
> then able to run a "rabbitmqctl list_queues" on the remaining server
> and verify that there are messages in the queue that have not been
> picked up.
>
> I restarted the process, and "/etc/init.d/rabbitmq-server start" never
> exited.  I had to ^C to get out, however, the server does appear to
> have restarted and rejoined the cluster.
>
> I can now run "rabbitmqclt list_queues" on both nodes in the cluster,
> but the messages are still not flowing and the node I restarted shows
> the following error when I run "rabbitmqclt list_queues":
>
> =ERROR REPORT==== 6-Sep-2011::20:31:35 ===
> Discarding message
> {'$gen_call',{<0.256.0>,#Ref<0.0.0.646>},{info,[name,messages]}} from
> <0.256.0> to <0.1382.0> in an old incarnation (2) of this node (3)
>
>
> On Tue, Sep 6, 2011 at 3:26 PM, Bryan Murphy <bmurphy1976 at gmail.com> wrote:
>> I created a 3 server RabbitMQ v2.6.0 cluster with all queues
>> configured for "x-ha-policy": "all".  Everything seemed to be working
>> fine, and then I killed one of the servers (halt).  This is in a
>> virtualized environment (ec2) so that server is gone.  I wanted to
>> simulate scenarios where one of the servers failed catastrophically.
>>
>> The problem is, now my cluster seems to be stuck in some sort of
>> deadlocked state.  One of my processes *thinks* it posted a message to
>> the cluster, but I am unable to list_queues to verify this and the
>> consumer has yet to receive this message.
>>
>> "rabbitmqctl list_queues" blocks forever and never returns.
>>
>> "/etc/init.d/rabbitmq-server stop" blocks forever and never actually
>> shuts down the server.
>>
>> "rabbitmqctl cluster_status" returns immediately showing me the three
>> registered nodes with only two listed as running:
>>
>> Cluster status of node 'rabbit at domU-12-31-39-00-E1-D7' ...
>> [{nodes,[{disc,['rabbit at ip-10-212-138-134','rabbit at ip-10-90-195-244',
>>                'rabbit at domU-12-31-39-00-E1-D7']}]},
>>  {running_nodes,['rabbit at ip-10-212-138-134','rabbit at domU-12-31-39-00-E1-D7']}]
>> ...done.
>>
>> as you can see, all three nodes were configured as "disk" nodes and
>> rabbit at ip-10-90-195-244 is no longer running.
>>
>> So, somehow the cluster seems to be deadlocked.  Since this server
>> cannot possibly be restored, how do I get out of this state?  Is there
>> a way I can forcefully tell the rest of the cluster to forget about
>> the missing server?  I can't find an example of how to do this in the
>> documentation.
>>
>> Thanks,
>> Bryan
>>
>