[rabbitmq-discuss] questions about RabbitMQ linear scalability

Wed Aug 28 10:57:08 BST 2013

Junius,

On 23 Aug 2013, at 03:14, Junius Wang wrote:
> Before every test ( after a new RabbitMQ node added/removed to the cluster):
> 1).we reset the policy to "exactly" on 2 nodes. 2). declared 10 queues using
> rabbitmqctl and force them distributed evenly on all nodes.  e.g.in a 2 node
> cluster,5 queues resides on rabbit1, another 5 resides on rabbit2. 3-3-4 for
> 3 node cluster ,3-3-2-2 for 4 node cluster.   3) declare a 'topic" exchange
> 4) binding the 10 queues to the exchange with 10 different routing keys.
> Publishers publish messages with the 10 routing keys by turns. Thus all
> queues receive the same number of messages.  
> Is the number of queues large enough?  we can declare more queues in test
> but I don't think there will be too many queues in production.  
> 

Let's re-visit the point Michael made about adding more queues:

"the way you extend capacity with RabbitMQ is by adding nodes and using more queues"

In any messaging system where queues exhibit FIFO behaviour, each queue is a concurrency bottleneck, because it can handle only 1 message at a time. If the queue handled more than 1 inbound message at a time, it would not be possible to enforce the FIFO characteristics, since without some kind of concurrency control the ordering would be non-deterministic. The question of whether or not the number of queues is "large enough" is application specific. Where you need FIFO ordering, you'll want a single queue. Where you are processing messages separately and/or do not require ordering (e.g., between messages of different "types") then you can use separate queues: this is a design choice that only you can make. The point Michael made is that the more queues you shove messages into, the better the concurrency will be, which in turn may yield performance benefits in terms of throughput.

As to the point about "adding nodes" - a single RabbitMQ node has a finite processing capacity, dependent on available operating system resources and client usage patterns. If you work out the maximum capacity "n" of a single node in your application, where the combination of available system resources and application load on the broker is at its peak, then by adding a second node, you might increase the system's overall capacity to something close to 2n. There are important caveats here though, as Michael has already pointed out: clients can connect to any node in a cluster, but if the queue with which they're communicating resides on another node, messages (both in and out of the broker) must be transmitted between nodes.

> publishers connect to the cluster via load balancer(AWS ELB) as well as
> consumers, they don't know to which node they are connect.  So we may not
> decrease the intra-cluster traffic. But  we can try some high bandwidth
> instances. Does this help?

Increased bandwidth might be helpful, but the points Michael made will still stand. There is always going to be "some" additional overhead due to inter-node communications, because network latency is the dominating factor here. Think about a classic 2-tier application: the biggest "cost" is usually communicating with the backend (e.g., a database or some such), and *that* cost is due to the network round trips more often than not.

> Another question is that we will have a queue with lots of messages
> handled, perhaps 5000/sec(size of 2K), which is about 90% of the total
> messages handled by the cluster.  It's hard for us to split it into multiple
> queues, it's really a big design change which we try to avoid. From your
> comments, even if messages are synchronized on only 2 nodes , hardly we can
> get linear scaling , right?

That's correct. The queue must handle messages in FIFO order and has an upper bound on how fast it can receive from and/or deliver messages to its clients.

> If so, is the network traffic the major factor impacting the performance? If so ,we can try some high bandwidth instances
> or using AWS instance group. Hopes that will get better performance.

Yes and no. If the clients interacting with the queue are connected to another node, then the cost of inter-node traffic will have an impact. A equally important potential cost you must be aware of though, is that of mirroring the queue across even two cluster nodes. When a queue is mirrored, the master queue process cannot take full responsibility for a message until that message has been accepted by all replicas. Even if the mirroring policy is set to "exactly 2 nodes", this still means that a message delivered to the queue on node-1 must also be transmitted to node-2. If the message is 'persistent', then both nodes must flush to disk. If confirms are in use, no confirm can be issued until both nodes have received the message, flushed it to disk and the master *knows* that this has taken place. That's a lot of work per-message.

So as Michael said, HA/Mirroring is not a feature that helps performance - it is a feature that increases resiliency in case of node failures. Adding bandwidth is only part of the story. Even with more bandwidth, there is still going to be higher latency, not only because of the amount of inter-node traffic, but also because each node has to complete some work in order to keep the replicas in sync.

I would suggest that you look very hard at that "queue with lots of messages handled" and see if you can split it up. Imagine if you were writing to a database and you had this one giant table that almost every transaction had to write to. This would become a bottleneck quickly and you'd look to split it up if possible, or if not then at least to partition it based on suitable indexes. We don't have a feature for "partitioning a queue over multiple nodes" yet, so you'll have to think about partitioning the messages yourself. If you can do that, and therefore increase the number of queues that are able to concurrently handle messages, you should get a decent performance increase.

HTH.

Tim

> 
> -----Original Message-----
> From: rabbitmq-discuss-bounces at lists.rabbitmq.com
> [mailto:rabbitmq-discuss-bounces at lists.rabbitmq.com] On Behalf Of Michael
> Klishin
> Sent: Thursday, August 22, 2013 4:56 PM
> To: Discussions about RabbitMQ
> Subject: Re: [rabbitmq-discuss] questions about RabbitMQ linear scalability
> 
> Junius Wang:
> 
>> 1.       The throughput of two node cluster is 50%-60% worse than a single
> node broker.
> 
> With mirroring, messages have to be copied to multiple (N or, depending on
> configuration, even all) nodes in the cluster. That obviously takes more
> time than not copying anything.
> 
>> 2.       Adding more node did have improvement on throughput but we only
> got 25% improvement(throughput of 3 node cluster is 25% better than 2 node
> cluster. 4 node cluster is 25% better than 3 node cluster too). What we
> expected is a 45-degree line, that means when 2 nodes are used the
> throughput is double. With 3 nodes, then triple.
> 
> You are not providing any details about your workload but the way you extend
> capacity with RabbitMQ is by adding nodes and using more queues. Queue
> mirroring is an HA feature, which means copying more data across the
> cluster.
> 
> Maximum throughput and highest availability are largely at odds with each
> other, so you need to
> 
> 1. Use multiple queues (and the number should grow with the number of nodes)
> 2. Choose what queues to mirror. Likely not all queues are equally important
> to your system, so you can make some of them non-HA.
> 3. Configure mirroring to, say, 2 nodes instead of "all".
> 4. If you know what node is the master for a particular queue (e.g. was
> declared on that node),
>     make your clients connect there. It will decrease intra-cluster
> traffic.
> --
> MK
> 
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss