[rabbitmq-discuss] MQ Cluster Replication Traffic Questions

Thu Feb 27 19:47:58 GMT 2014

Hello folks, we're trying to troubleshoot our MQ clusters that kept
partitioning, despite their use of a direct cross-over connection to avoid
issues with a switch failing or cycling. We have 6 servers split into 3
clusters. All boxes accept traffic from producers and consumers on eth0
(connected to the switch) and eth1 is connected the other box in the
clustered pair. We use a host file override on each box to direct MQ traffic
over the crossover and Rabbit binds to all IPs. Both NICs on each box are
1gs.

Despite the cross-over we were seeing network partition alerts with version
3.2.2. We saw NIC reset errors (Intel NICs) and just upgraded the drivers to
fend off that problem and tried some buffer tuning. But we're still dropping
packets on the cross-over interface so I'm worried the partitions may
continue. Here are the questions I have:

1)      Is it a bad idea to use a cross-over like this? 

2)      We're seeing ~2.5Mbps in / ~10Mbps out on the public eth0 interface
but ~45Mbps in / ~30Mbps out on the cross-over. Is that kind of
amplification normal?

3)      If it's ok to use the cross-over, what TCP tuning am I missing?

Here are some more stats from our setup:

Ubuntu 12.04  3.2.0-30-generic

~5000 connections / 6000 queues / 12000 channels per cluster

~1 dropped packet ever few minutes on the cross-over if. No errors or
overruns, etc

net.ipv4.tcp_wmem = 10240 524288 16777216 

net.ipv4.tcp_rmem = 10240 524288 16777216 

net.core.rmem_max = 16777216 

net.core.wmem_max = 16777216

and the "packets collapsed in receive queue due to low socket buffer" keeps
incrementing

Thanks!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140227/99fe77b7/attachment.html>