[rabbitmq-discuss] RabbitMQ load issues

Fri Jan 6 21:16:40 GMT 2012

Hi All,

At Alexis(@monadic)'s request, I'm posting a few details of an issue
that we were back and forth a bit on over Twitter about(I'm @j14159).

Full disclosure:  I expect that what I'm doing(and the type of server
I'm doing it on) may not be(or really isn't at all)
ideal/correct/whatever.

Here's the basic gist of what I'm doing:

- Ubuntu server 10.04
- RabbitMQ version 2.6.1(installed from RabbitMQ's package repo)
- Erlang R13B03(stock from Ubuntu)
- Single RabbitMQ node running on a stock Amazon EC2 m1-large(if I
recall correctly, 2 cores and 8gigs of RAM).
- Single direct exchange
- Single topic exchange
- Single vhost("/")

There are a set of harvesters collecting data from long running TCP
streams.  These harvesters send individual items to a set of
processors via the direct exchange.

The processors categorize the items by the userId they concern and a
sub-category of the type of event(there are 7 categories) and sends
them to the topic exchange with a topic following the format:

<the source of the data, irrelevant for now>.<userId>.<item's category>

So given the item categories, there are up to 7 different specific
bindings(am I using the right term?) for a queue, NOT including
wildcard ones, e.g.

"<source>.1542.category1"
"<source>.1542.category2"
"<source>.7455.category6"

Yesterday I started hammering our services a bit to test the limits,
here's the basic outline of what we did:

- There is a service that consumes individual items from the topic
exchange, reporting to user-land applications.
- A given user may subscribe a la carte to any combination of 6 out of
the 7 categories of items.
- the service creates an individual queue per category and listens for
updates(built using Akka 1.2's AMQP module)

When we got to about 110,000 queues on the topic exchange, RabbitMQ
basically stopped responding(on the direct exchange as well).  After a
little while(10-15min), rabbitmqctl could no longer even connect to
the still-running RabbitMQ instance(on the same machine).  RabbitMQ(2
processes listed in top/htop that I could see) was using ~5-ish gigs
of RAM.

We're in the process of getting a proper RabbitMQ cluster setup on
boxes with more RAM(~20gigs I think), likely 3 nodes so that will
likely alleviate some of the problems.  I've also refactored the top
layer service in question to keep a single userId's subscriptions
silo'd in one consumer(wildcard topic) which reduces the number of
bindings/queues so we're hardly in panic mode just yet ;)

Additionally, there will be more different top layer services
listening to the topic exchange in different ways - topic exchanges
provide some flexibility that we've pretty much settled on being
necessary so going to something like direct won't do, as far as I can
tell.

Sorry for the rambling post there, any and all
input/criticism/whatever is most welcome.

Regards,

Jeremy Pierre