[rabbitmq-discuss] Recurring partitioning problem on local network

Bill Chmura bchmura at nurturhealth.com
Wed Dec 4 14:06:11 GMT 2013


Hi Simon,

That was an interesting thought I had not considered, but I do not think that is the case.  They are all part of the same domain and from what I gather all the servers sync up to a domain controller and the domain controllers nominate one amongst themselves.  Also, the time seems close enough within a tick that they notice each other being gone...  I'm sending off to one of my sever guys to see if there is any chance of the VM playing catchup or something suddenly with the time anyway, checking the logs and stuff for adjustments.

For the second disk node, I hear you.  In production there are multiple everything's including app-servers which is what will be running the disk nodes.   We are also going to set up the mirroring but until the underlying infrastructure is solid I don't want to add complexity

I am going to update our other platforms with the new net-tick settings and see if any of them fail.






Regards,

Bill Chmura  
-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com] 
Sent: Wednesday, December 04, 2013 6:46 AM
To: Discussions about RabbitMQ
Cc: Bill Chmura
Subject: Re: [rabbitmq-discuss] Recurring partitioning problem on local network

Hi.

Could the clocks be jumping? If the clock jumped forward by a minute or so then the Erlang VM could maybe determine that there hasn't been any response from another node for a minute, which would be enough to trigger the node down condition.

Is there anything else in the logs for the time period just before the node down message?

As an aside, you might not want to have a single disk node in a cluster, that's a single point of failure.

Cheers, Simon

On 03/12/13 21:54, Bill Chmura wrote:
> Hi,
>
> We are experiencing a frequent partitioning problem on our network 
> with our RabbitMQ cluster.  We've not been able to iron it out and are 
> running out of time before this needs to move into production.
>
> I'll just focus on our DEV environment as it is just a scaled down 
> version of the other environments.
>
> Running on ESXi virtual machines (4GB Ram, 4 Cores dedicated)
>
> Windows2008 R2 SP1 64-Bit
>
> RabbitMQ 3.2.0
>
> Erlang R16B02
>
> There are two web servers and one app server in the above config.
> These are all connected to the same network.  Each servers is running 
> its own node - clustering is done through the rabbit config file.
>
> Devweb01 - Ram
>
> Devweb02 - Ram
>
> DevApp01 - Disk
>
> The problem is that every once in a while it starts partitioning off 
> nodes, with nothing really correlating with it happening... not big 
> traffic on the network, no disruptions we can find, etc.  We have gone 
> through and made sure there were no VM settings that allowed items to 
> "go to sleep" or anything aside from a "high performance setting"
> (versus power savings).
>
> Here is what we are seeing in the logs... which to me looks like a 
> network interruption, but nothing else indicated that the machine was 
> having issues.  We have a load balancer that flags systems with 
> problems, we have a zenoss node monitoring the servers, we checked the 
> ESXi charts and logs, we looked through windows system logs... nothing 
> seems to have been amiss.
>
> *In one partitioning event we saw this in the WebApp server log:*
>
> =INFO REPORT==== 27-Nov-2013::18:11:07 ===
>
> rabbit on node 'rabbit at NURWEB-DEVWEB01' down
>
> =ERROR REPORT==== 27-Nov-2013::18:11:10 ===
>
> Mnesia('rabbit at NURWEB-DEVAPP01'): ** ERROR ** mnesia_event got 
> {inconsistent_database, running_partitioned_network, 
> 'rabbit at NURWEB-DEVWEB01'}
>
> *And in the DevWeb01 machine mentioned's logs we saw this - it also 
> mentions it lost connections to both of the other boxes.*
>
> =INFO REPORT==== 27-Nov-2013::18:10:53 ===
>
> rabbit on node 'rabbit at NURWEB-DEVAPP01' down
>
> =ERROR REPORT==== 27-Nov-2013::18:10:53 ===
>
> Mnesia('rabbit at NURWEB-DEVWEB01'): ** ERROR ** mnesia_event got 
> {inconsistent_database, running_partitioned_network, 
> 'rabbit at NURWEB-DEVAPP01'}
>
> =ERROR REPORT==== 27-Nov-2013::18:10:59 ===
>
> Mnesia('rabbit at NURWEB-DEVWEB01'): ** ERROR ** mnesia_event got 
> {inconsistent_database, running_partitioned_network, 
> 'rabbit at NURWEB-DEVWEB02'}
>
> =INFO REPORT==== 27-Nov-2013::18:11:00 ===
>
> only running disc node went down
>
> =INFO REPORT==== 27-Nov-2013::18:11:01 ===
>
> rabbit on node 'rabbit at NURWEB-DEVWEB02' down
>
> =INFO REPORT==== 27-Nov-2013::18:11:04 ===
>
> only running disc node went down
>
> *And web02 only mentions the web01 going down*
>
> =INFO REPORT==== 27-Nov-2013::18:11:09 ===
>
> rabbit on node 'rabbit at NURWEB-DEVWEB01' down
>
> =ERROR REPORT==== 27-Nov-2013::18:11:11 ===
>
> Mnesia('rabbit at NURWEB-DEVWEB02'): ** ERROR ** mnes
>
> None of the rabbit installs are actually down during this...
>
> We've seen the same thing on our qa and production boxes - which are 
> the same configurations, just with more nodes.  Not many though 8 
> nodes on production.
>
> Any ideas would be really appreciated!  I've recently added a 
> net_ticktime to my dev servers to set that at 120 (double I believe) 
> to see if that helps.
>
> *Bill *
>
> /
>
> /
>
>
> This email and all attachments are confidential and intended solely 
> for the use of the individual or entity to which they are addressed.
> If you have received this email in error please notify the sender by 
> replying to this message. If you are not the intended recipient, 
> please delete this message and all attachments immediately.  Do not 
> copy, disclose, use or act upon the information contained. Please note 
> that any views or opinions presented in this email are solely those of 
> the author and do not necessarily represent those of the company. 
> Finally, the recipient should check this email and any attachments for 
> the presence of viruses. While every attempt is made to verify that 
> the contents are safe, the company accepts no liability for any damage 
> caused by any virus transmitted by this email.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>


--
Simon MacMullen
RabbitMQ, Pivotal

This email and all attachments are confidential and intended solely 
for the use of the individual or entity to which they are addressed. 
If you have received this email in error please notify the sender 
by replying to this message. If you are not the intended recipient, 
please delete this message and all attachments immediately.  Do not 
copy, disclose, use or act upon the information contained. Please 
note that any views or opinions presented in this email are solely 
those of the author and do not necessarily represent those of the 
company. Finally, the recipient should check this email and any 
attachments for the presence of viruses. While every attempt is made 
to verify that the contents are safe, the company accepts no liability 
for any damage caused by any virus transmitted by this email.



More information about the rabbitmq-discuss mailing list