[rabbitmq-discuss] Cluster Memory Usage

Mon Nov 21 23:14:20 GMT 2011

On Mon, Nov 21, 2011 at 4:50 PM, Matthias Radestock
<matthias at rabbitmq.com> wrote:
> Travis,
>
> On 21/11/11 22:16, Travis wrote:
>>
>> Yes.  We've seen this problem when there are only a few dozen messages
>> and when there are hundreds of thousands.
>>
>> Basically, it seems like the slave is always taking up 2-5x the amount
>> of memory that the master uses.
>
> In my tests the slave uses a bit less memory than the master, which is what
> I would expect in a relatively lightly loaded system when there are no
> connections to the slave node.
>
>> Looking at the two servers, the slave shows that it's beam.smp has
>> spent 12 more hours on cpu than the beam.smp on the master.  This
>> doesn't make sense to me if the slave is doing less work because it's
>> only handling traffic coming from the master.
>
> That's odd indeed. In my test I see the slave getting by with less than half
> the CPU utilisation of the master.
>
> Do you see this memory and CPU pattern - with the slave using more than the
> master - all the time, including shortly after a restart?

It usually takes some time to build up to this state.  Basically when
we get into it, the remote rabbits are only able to shovel data in at
a very small trickle (on the order of tens of messages a second).

We presume that what's happening is that the slave is bouncing in and
out of the vm_memory_high_watermark warning state causing the master
to slow down sending messages to the slave.  The master would then
tell everything up stream to slow down because it wasn't able to
handle messages fast enough.  We thought killing the slave would get
the master out of this state since it wouldn't be attempting to pass
messages to it.  But, this is not the case.  The throughput doesn't go
back up to "normal" levels until we restart the master.

In today's case, we were seeing about 40-80 messages a second being
delivered when in this weird state, both before and after the slave
was stopped.  When we restarted the master, we began seeing about 4000
messages a second being delivered.  So, something strange is
definitely occurring.

>
> I still haven't got a good trail to follow here. One thing to look at is the
> output of 'rabbitmqctl report' when the system is in the "excessive memory
> use" state. Please send us that.

I can do that when we next get into this state; it takes a day or two
between restarts of the service.  Unfortunately, we had to restart
both the master and slave about 30 minutes ago because we had a
backlog of about a million messages queued on the shoveling rabbitmq's
that just weren't making it to the cluster in a timely manner.  For
reference, we're only pushing about 14 million messages through the
cluster a day, which doesn't seem like a lot to us.

Travis
-- 
Travis Campbell
travis at ghostar.org