<p dir="ltr">We did not change netticktime, and the other nodes in the cluster were frozen for about an hour by the time networking was active again on the node that crashed.</p>

<p dir="ltr">Frustratingly there was nothing in the logs, but I think that's because of the bug fixed in 3.3.3 and we went live with 3.3.2 :(  we started upgrading on Friday to fix that... </p>

<p dir="ltr">Dan.</p>

<div class="gmail_quote">On Jun 30, 2014 6:23 AM, "Simon MacMullen" <<a href="mailto:simon@rabbitmq.com">simon@rabbitmq.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 27/06/14 20:55, Daniel Burke wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Today we had the physical machine of a dedicated node kernel panic<br>

(linux centos 6)... when that happened the other two nodes in the<br>

cluster seemed to choke, and not respond at all.<br>

<br>

"rabbitmqctl cluster_status" on either of the other nodes would hang.<br>

<br>

The web management UI didn't respond.  I could get a login page to come<br>

up but after that it would go back to not responding.<br>

</blockquote>

<br>

The management UI and "rabbitmqctl cluster_status" can hang for a short while, while the live nodes attempt to contact the crashed node but haven't got an answer from it. Once the live nodes decide that the dead node is in fact dead, the UI and rabbitmqctl will become responsive again. This time period is defined by net_ticktime (see <a href="http://www.rabbitmq.com/nettick.html" target="_blank">http://www.rabbitmq.com/<u></u>nettick.html</a>).<br>


<br>

* Have you changed this setting?<br>

* Did messages about the node being down get logged by the other nodes? When?<br>

<br>

Cheers, Simon<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

When the crashed machine came back up, without starting rabbitmq on it,<br>

once networking was responding, the other two nodes seemed to free up<br>

and start operating normally again.<br>

<br>

After the rest of the cluster was operating normally again, we brought<br>

down the crashed machine to do a memtest, and we didn't experience the<br>

cluster freeze again (rabbitmq was not ever started back up on the<br>

failed node).<br>

<br>

This cluster (we went live with multiple clusters yesterday), is running<br>

3 physical dedicated machines.  All of them are on centos 6.  RabbitMQ v<br>

3.3.2.  All nodes are disc nodes.  All queues are durable and mirrored.<br>

  This cluster has one queue, plus 1000's of dynamic shovels (which of<br>

course includes their own queues on this cluster) connecting to queues<br>

on 3 other clusters with similar setups.  Each node has about 7gig of<br>

disk free on the relevant partition, and 48gig of ram with the<br>

high_water_mark set to 0.9, but even at diminished capacity right now,<br>

the most ram used is 1.2gig on one node and 600meg on the other (these<br>

boxes were way over built with short-term growth in mind).<br>

<br>

Sadly, there was nothing in the logs.  We realized this might be related<br>

to the logging bug fixed in 3.3.3, so we just upgraded our dev<br>

environment to start the process to deal with that.<br>

<br>

Any thoughts on what the cause of this freeze up could have been?  And<br>

how to mitigate it?  Or any troubleshooting / information gathering we<br>

could do if it happens again?  It's a scary thing now to have happen on<br>

a friday afternoon.  We were counting on three node clusters getting us<br>

through if there was an outage of a node during the weekend... but now<br>

we're all afraid to go home for the weekend!<br>

<br>

Thanks!<br>

Dan.<br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

rabbitmq-discuss mailing list has moved to <a href="https://groups.google.com/forum/#!forum/rabbitmq-users" target="_blank">https://groups.google.com/<u></u>forum/#!forum/rabbitmq-users</a>,<br>

please subscribe to the new list!<br>

<br>

<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.<u></u>rabbitmq.com</a><br>

<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/<u></u>cgi-bin/mailman/listinfo/<u></u>rabbitmq-discuss</a><br>

<br>

</blockquote>

<br>

<br>

-- <br>

Simon MacMullen<br>

RabbitMQ, Pivotal<br>

</blockquote></div>