[rabbitmq-discuss] Cluster hung on node death

Mon Jun 30 12:11:18 BST 2014

We did not change netticktime, and the other nodes in the cluster were
frozen for about an hour by the time networking was active again on the
node that crashed.

Frustratingly there was nothing in the logs, but I think that's because of
the bug fixed in 3.3.3 and we went live with 3.3.2 :(  we started upgrading
on Friday to fix that...

Dan.
On Jun 30, 2014 6:23 AM, "Simon MacMullen" <simon at rabbitmq.com> wrote:

> On 27/06/14 20:55, Daniel Burke wrote:
>
>> Today we had the physical machine of a dedicated node kernel panic
>> (linux centos 6)... when that happened the other two nodes in the
>> cluster seemed to choke, and not respond at all.
>>
>> "rabbitmqctl cluster_status" on either of the other nodes would hang.
>>
>> The web management UI didn't respond.  I could get a login page to come
>> up but after that it would go back to not responding.
>>
>
> The management UI and "rabbitmqctl cluster_status" can hang for a short
> while, while the live nodes attempt to contact the crashed node but haven't
> got an answer from it. Once the live nodes decide that the dead node is in
> fact dead, the UI and rabbitmqctl will become responsive again. This time
> period is defined by net_ticktime (see http://www.rabbitmq.com/
> nettick.html).
>
> * Have you changed this setting?
> * Did messages about the node being down get logged by the other nodes?
> When?
>
> Cheers, Simon
>
>  When the crashed machine came back up, without starting rabbitmq on it,
>> once networking was responding, the other two nodes seemed to free up
>> and start operating normally again.
>>
>> After the rest of the cluster was operating normally again, we brought
>> down the crashed machine to do a memtest, and we didn't experience the
>> cluster freeze again (rabbitmq was not ever started back up on the
>> failed node).
>>
>> This cluster (we went live with multiple clusters yesterday), is running
>> 3 physical dedicated machines.  All of them are on centos 6.  RabbitMQ v
>> 3.3.2.  All nodes are disc nodes.  All queues are durable and mirrored.
>>   This cluster has one queue, plus 1000's of dynamic shovels (which of
>> course includes their own queues on this cluster) connecting to queues
>> on 3 other clusters with similar setups.  Each node has about 7gig of
>> disk free on the relevant partition, and 48gig of ram with the
>> high_water_mark set to 0.9, but even at diminished capacity right now,
>> the most ram used is 1.2gig on one node and 600meg on the other (these
>> boxes were way over built with short-term growth in mind).
>>
>> Sadly, there was nothing in the logs.  We realized this might be related
>> to the logging bug fixed in 3.3.3, so we just upgraded our dev
>> environment to start the process to deal with that.
>>
>> Any thoughts on what the cause of this freeze up could have been?  And
>> how to mitigate it?  Or any troubleshooting / information gathering we
>> could do if it happens again?  It's a scary thing now to have happen on
>> a friday afternoon.  We were counting on three node clusters getting us
>> through if there was an outage of a node during the weekend... but now
>> we're all afraid to go home for the weekend!
>>
>> Thanks!
>> Dan.
>>
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list has moved to https://groups.google.com/
>> forum/#!forum/rabbitmq-users,
>> please subscribe to the new list!
>>
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140630/a90ea39f/attachment.html>