[rabbitmq-discuss] Recurring partitioning problem on local network

Tue Dec 10 16:02:42 GMT 2013

On 09/12/13 22:20, Bill Chmura wrote:
> Can someone confirm for me that I understand these events correctly?  I
> would really appreciate it
>
> =INFO REPORT==== 9-Dec-2013::05:43:41 ===
>
> rabbit on node 'rabbit at NURWEB-QAAPP01' down
>
> // The above indicates the net_ticktime expired without a good response?

Either that or the Erlang VM got positive confirmation that the node was 
down - you'll see this message in the logs when deliberately shutting 
down the node for example, without waiting for net_ticktime.

> =ERROR REPORT==== 9-Dec-2013::05:43:41 ===
>
> Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network,
> 'rabbit at NURWEB-QAAPP01'}
>
> // This means it got back in touch with QAAPP01 and neither had gotten
> good results from the net_ticktime specified “pings”.  So it partitioned
> itself off.

It got back in touch and both nodes thought the other one had been down 
- i.e. that a network partition has occurred. So this is notification of 
something that has already happened, not a decision being made.

>  So close to it being down though?

Agreed, that's suspicious.

> =ERROR REPORT==== 9-Dec-2013::05:43:41 ===
>
> ** Generic server <0.323.0> terminating
>
> ** Last message in was {mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}
>
> ** When Server state == {state,<0.321.0>,<0.322.0>,rabbit_mgmt_sup,
>
>                              [{rabbit_mgmt_db,
>
>                                   {rabbit_mgmt_db,start_link,[]},
>
>                                   permanent,4294967295,worker,
>
>                                   [rabbit_mgmt_db]}]}
>
> ** Reason for termination ==
>
> ** {unexpected_info,{mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}}
>
> // This (above) I have seen rarely, but it seems related – any ideas
> aside from the node crashed?

We've not seen many reports of this, but it seems to be associated with 
a node being down for a very short time.

> =INFO REPORT==== 9-Dec-2013::05:43:43 ===
>
> only running disc node went down
>
> // Above indicates the cluster as it is now no longer has QAAPP01 which
> is the disk node.

Yes, this is to suggest that the cluster is in a dangerous state; if a 
cluster consisting only of RAM nodes is shut down you will experience 
data loss.

(As an aside: you might want to have more disc nodes to prevent this!)

> =ERROR REPORT==== 9-Dec-2013::05:45:35 ===
>
> Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network,
> 'rabbit at NURWEB-QAWEB02'}
>
> // The above indicates that it just portioned itself off from another of
> the cluster

Same as above.

> Any help would be very much appreciated – especially confirming that I
> understand the above events will help with the troubleshooting process.

So the fact that the node comes back almost as soon as it is seen as 
down is quite dubious to me. If you are able to run a patched version of 
the server it might be interesting to log why nodes are considered to be 
down (whether it actually is net_ticktime).

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal