[rabbitmq-discuss] Recurring partitioning problem on local network

Bill Chmura bchmura at nurturhealth.com
Tue Dec 10 16:56:06 GMT 2013


I think we are willing to do anything :)

Right now we are aiming to run without clustering for the short term - we need to be up and stable.

What do you mean by a patched version?  Are there instructions?

We are also doing that other testing to try and replicate the issue... should have been this morning, but we've had a weather delay



-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com] 
Sent: Tuesday, December 10, 2013 11:03 AM
To: Discussions about RabbitMQ
Cc: Bill Chmura
Subject: Re: [rabbitmq-discuss] Recurring partitioning problem on local network

On 09/12/13 22:20, Bill Chmura wrote:
> Can someone confirm for me that I understand these events correctly?  
> I would really appreciate it
>
> =INFO REPORT==== 9-Dec-2013::05:43:41 ===
>
> rabbit on node 'rabbit at NURWEB-QAAPP01' down
>
> // The above indicates the net_ticktime expired without a good response?

Either that or the Erlang VM got positive confirmation that the node was down - you'll see this message in the logs when deliberately shutting down the node for example, without waiting for net_ticktime.

> =ERROR REPORT==== 9-Dec-2013::05:43:41 ===
>
> Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got 
> {inconsistent_database, running_partitioned_network, 
> 'rabbit at NURWEB-QAAPP01'}
>
> // This means it got back in touch with QAAPP01 and neither had gotten 
> good results from the net_ticktime specified "pings".  So it 
> partitioned itself off.

It got back in touch and both nodes thought the other one had been down
- i.e. that a network partition has occurred. So this is notification of something that has already happened, not a decision being made.

>  So close to it being down though?

Agreed, that's suspicious.

> =ERROR REPORT==== 9-Dec-2013::05:43:41 ===
>
> ** Generic server <0.323.0> terminating
>
> ** Last message in was {mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}
>
> ** When Server state == {state,<0.321.0>,<0.322.0>,rabbit_mgmt_sup,
>
>                              [{rabbit_mgmt_db,
>
>                                   {rabbit_mgmt_db,start_link,[]},
>
>                                   permanent,4294967295,worker,
>
>                                   [rabbit_mgmt_db]}]}
>
> ** Reason for termination ==
>
> ** {unexpected_info,{mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}}
>
> // This (above) I have seen rarely, but it seems related - any ideas 
> aside from the node crashed?

We've not seen many reports of this, but it seems to be associated with a node being down for a very short time.

> =INFO REPORT==== 9-Dec-2013::05:43:43 ===
>
> only running disc node went down
>
> // Above indicates the cluster as it is now no longer has QAAPP01 
> which is the disk node.

Yes, this is to suggest that the cluster is in a dangerous state; if a cluster consisting only of RAM nodes is shut down you will experience data loss.

(As an aside: you might want to have more disc nodes to prevent this!)

> =ERROR REPORT==== 9-Dec-2013::05:45:35 ===
>
> Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got 
> {inconsistent_database, running_partitioned_network, 
> 'rabbit at NURWEB-QAWEB02'}
>
> // The above indicates that it just portioned itself off from another 
> of the cluster

Same as above.

> Any help would be very much appreciated - especially confirming that I 
> understand the above events will help with the troubleshooting process.

So the fact that the node comes back almost as soon as it is seen as down is quite dubious to me. If you are able to run a patched version of the server it might be interesting to log why nodes are considered to be down (whether it actually is net_ticktime).

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal

This email and all attachments are confidential and intended solely 
for the use of the individual or entity to which they are addressed. 
If you have received this email in error please notify the sender 
by replying to this message. If you are not the intended recipient, 
please delete this message and all attachments immediately.  Do not 
copy, disclose, use or act upon the information contained. Please 
note that any views or opinions presented in this email are solely 
those of the author and do not necessarily represent those of the 
company. Finally, the recipient should check this email and any 
attachments for the presence of viruses. While every attempt is made 
to verify that the contents are safe, the company accepts no liability 
for any damage caused by any virus transmitted by this email.



More information about the rabbitmq-discuss mailing list