[rabbitmq-discuss] Recurring partitioning problem on local network

Mon Dec 9 22:20:11 GMT 2013

Hello,

Even after upping the net_tick_time to 120 we still have had issues with the cluster partioning.  Although today we came up with evidence that this is tied to our deployment process to our production systems.  Hence I am back to thinking this is a network or I/O problem.

Is there anything we can watch specifically when we do the deploy to catch what is going on in the erlang / rabbit world?  As I mentioned before we are not seeing anything overt in our monitoring or performance charts.  We plan on re-looking at it as we re-enact the deployment over and over again.   Our focus will be pretty much on the network and NIC I/O and virtual interfaces, etc.  Previous checks of the deployment stage showed overall network usage spiked to about 30% so it seemed unlikely, but....

Can someone confirm for me that I understand these events correctly?  I would really appreciate it

=INFO REPORT==== 9-Dec-2013::05:43:41 ===
rabbit on node 'rabbit at NURWEB-QAAPP01' down

// The above indicates the net_ticktime expired without a good response?

=ERROR REPORT==== 9-Dec-2013::05:43:41 ===
Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-QAAPP01'}

// This means it got back in touch with QAAPP01 and neither had gotten good results from the net_ticktime specified "pings".  So it partitioned itself off.  So close to it being down though?

=ERROR REPORT==== 9-Dec-2013::05:43:41 ===
** Generic server <0.323.0> terminating
** Last message in was {mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}
** When Server state == {state,<0.321.0>,<0.322.0>,rabbit_mgmt_sup,
                            [{rabbit_mgmt_db,
                                 {rabbit_mgmt_db,start_link,[]},
                                 permanent,4294967295,worker,
                                 [rabbit_mgmt_db]}]}
** Reason for termination ==
** {unexpected_info,{mnesia_locker,'rabbit at NURWEB-QAAPP01',granted}}

// This (above) I have seen rarely, but it seems related - any ideas aside from the node crashed?

=INFO REPORT==== 9-Dec-2013::05:43:43 ===
only running disc node went down

// Above indicates the cluster as it is now no longer has QAAPP01 which is the disk node.

=ERROR REPORT==== 9-Dec-2013::05:45:35 ===
Mnesia('rabbit at NURWEB-QAWEB01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-QAWEB02'}

// The above indicates that it just portioned itself off from another of the cluster

Any help would be very much appreciated - especially confirming that I understand the above events will help with the troubleshooting process.

Regards,

Bill Chmura    Director, IT Development Services

From: Bill Chmura
Sent: Tuesday, December 03, 2013 4:54 PM
To: 'rabbitmq-discuss at lists.rabbitmq.com'
Subject: Recurring partitioning problem on local network

Hi,

We are experiencing a frequent partitioning problem on our network with our RabbitMQ cluster.  We've not been able to iron it out and are running out of time before this needs to move into production.

I'll just focus on our DEV environment as it is just a scaled down version of the other environments.

Running on ESXi virtual machines (4GB Ram, 4 Cores dedicated)
Windows2008 R2 SP1 64-Bit
RabbitMQ 3.2.0
Erlang R16B02

There are two web servers and one app server in the above config.   These are all connected to the same network.  Each servers is running its own node - clustering is done through the rabbit config file.

Devweb01 - Ram
Devweb02 - Ram
DevApp01 - Disk

The problem is that every once in a while it starts partitioning off nodes, with nothing really correlating with it happening... not big traffic on the network, no disruptions we can find, etc.  We have gone through and made sure there were no VM settings that allowed items to "go to sleep" or anything aside from a "high performance setting"  (versus power savings).

Here is what we are seeing in the logs... which to me looks like a network interruption, but nothing else indicated that the machine was having issues.  We have a load balancer that flags systems with problems, we have a zenoss node monitoring the servers, we checked the ESXi charts and logs, we looked through windows system logs... nothing seems to have been amiss.

In one partitioning event we saw this in the WebApp server log:

=INFO REPORT==== 27-Nov-2013::18:11:07 ===
rabbit on node 'rabbit at NURWEB-DEVWEB01' down

=ERROR REPORT==== 27-Nov-2013::18:11:10 ===
Mnesia('rabbit at NURWEB-DEVAPP01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-DEVWEB01'}

And in the DevWeb01 machine mentioned's logs we saw this - it also mentions it lost connections to both of the other boxes.

=INFO REPORT==== 27-Nov-2013::18:10:53 ===
rabbit on node 'rabbit at NURWEB-DEVAPP01' down

=ERROR REPORT==== 27-Nov-2013::18:10:53 ===
Mnesia('rabbit at NURWEB-DEVWEB01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-DEVAPP01'}

=ERROR REPORT==== 27-Nov-2013::18:10:59 ===
Mnesia('rabbit at NURWEB-DEVWEB01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-DEVWEB02'}

=INFO REPORT==== 27-Nov-2013::18:11:00 ===
only running disc node went down

=INFO REPORT==== 27-Nov-2013::18:11:01 ===
rabbit on node 'rabbit at NURWEB-DEVWEB02' down

=INFO REPORT==== 27-Nov-2013::18:11:04 ===
only running disc node went down

And web02 only mentions the web01 going down

=INFO REPORT==== 27-Nov-2013::18:11:09 ===
rabbit on node 'rabbit at NURWEB-DEVWEB01' down

=ERROR REPORT==== 27-Nov-2013::18:11:11 ===
Mnesia('rabbit at NURWEB-DEVWEB02'): ** ERROR ** mnes

None of the rabbit installs are actually down during this...

We've seen the same thing on our qa and production boxes - which are the same configurations, just with more nodes.  Not many though 8 nodes on production.

Any ideas would be really appreciated!  I've recently added a net_ticktime to my dev servers to set that at 120 (double I believe) to see if that helps.

Bill

This email and all attachments are confidential and intended solely 
for the use of the individual or entity to which they are addressed. 
If you have received this email in error please notify the sender 
by replying to this message. If you are not the intended recipient, 
please delete this message and all attachments immediately.  Do not 
copy, disclose, use or act upon the information contained. Please 
note that any views or opinions presented in this email are solely 
those of the author and do not necessarily represent those of the 
company. Finally, the recipient should check this email and any 
attachments for the presence of viruses. While every attempt is made 
to verify that the contents are safe, the company accepts no liability 
for any damage caused by any virus transmitted by this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131209/cd381fa1/attachment.html>