<div dir="ltr"><span style="font-family:arial,sans-serif;font-size:13px">Hello everyone!</span><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">We are trying to understand RabbitMQ behavior that we see in a highly available cluster and I'm hoping someone here can shed some light on it.</div>
<div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">We have a 3-node cluster that is exposed to the application via an F5 load balancer by a virtual IP. The majority of the queues that are created are highly available queues in order to prevent message loss from happening in the event of a catastrophic failure of a node.</div>
<div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">The rabbit instances/nodes are 3.2.2 with Erlang 16B02.</div><div style="font-family:arial,sans-serif;font-size:13px">
<br></div><div style="font-family:arial,sans-serif;font-size:13px">The application consists of a series of REST HTTP interfaces which place the incoming messages onto the rabbit queues.</div><div style="font-family:arial,sans-serif;font-size:13px">
<br></div><div style="font-family:arial,sans-serif;font-size:13px">When we load test the application we are seeing a periodic delay in writing a message to Rabbit. The delay happens about once every few minutes and lasts up to 30 seconds.</div>
<div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">We have tested in several scenarios:</div><div style="font-family:arial,sans-serif;font-size:13px">
3 node load balanced cluster</div><div style="font-family:arial,sans-serif;font-size:13px">single node, non ha-policy applied machine</div><div style="font-family:arial,sans-serif;font-size:13px">on Windows Server 2008R2</div>
<div style="font-family:arial,sans-serif;font-size:13px">on Linux (Oracle Linux)</div><div style="font-family:arial,sans-serif;font-size:13px">on VMWare virtual machines on high speed SAN storage</div><div style="font-family:arial,sans-serif;font-size:13px">
on physical machines with SSD</div><div style="font-family:arial,sans-serif;font-size:13px">direct connection with no LB to single node</div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">
We haven't tested every combination in this list but we've tried to isolate I/O, operating system, machine characteristics, etc.</div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">
But in every test scenario under load we start to see the socket opened to Rabbit on a write operation and no response for 20-30 seconds (with network tracing tools).</div><div style="font-family:arial,sans-serif;font-size:13px">
<br></div><div style="font-family:arial,sans-serif;font-size:13px">Has anyone seen any behavior like this? </div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">
Our concern is that under load some REST interfaces will show periodic slowness; we have SLAs of ~1 second on the interfaces.</div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">
Thanks for any input!</div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">Cheers,</div><div style="font-family:arial,sans-serif;font-size:13px">
<br></div><div style="font-family:arial,sans-serif;font-size:13px">Ron Cordell</div></div>