[rabbitmq-discuss] Clarification on semantics of publisher confirms

Tue Jan 17 11:31:12 GMT 2012

On 16/01/12 22:12, Simone Busoli wrote:
> Hi, while using the Federation plugin with publisher confirms during
> some load tests I noticed a behavior I wasn't expecting after the link
> between two brokers with a federated exchange went down:

Hi Simone.

>     * The link stayed down for several hours, and around 100k messages
>       accumulated on the upstream broker

I hope you mean "The network was down for several hours" not "RabbitMQ 
took several hours to re-establish the link".

>     * The upstream kept publishing messages at a rate of 10/s during the
>       network failure
>     * The downstream had 5 channels with consumers consuming each from a
>       queue
>     * Every queue/exchange is durable messages are persistent and
>       autoack is off on the clients
>     * Using the default unbounded prefetch thresholds both for the
>       federation and the client channels
>
> Once the connection has restored I noticed several things:
>
>     * the upstream started delivering messages to the downstream,
>       apparently overflowing it since its CPU stayed at 100% for several
>       minutes
>     * none of the clients connected to the downstream received anything
>       for quite some time, not sure when they exactly started receiving
>       messages
>     * the UI interface kept showing lots of undelivered and unconfirmed
>       messages on the federation outbound queue

It is currently possible to overload a RabbitMQ server by sending it a 
huge number of small messages very fast. The symptoms are as you 
describe (since the messages get backed up before they get to a queue in 
the downstream broker). Eventually if you keep sending messages the 
memory alarm will go off and then the broker will have time to sort 
itself out, but this can lead to quite a delay.

We're looking to improve this situation in the next release.

> After some time, around two hours, the upstream broker completed
> delivering all the messages to the downstream and the downstream
> confirmed all of them. The clients are currently still catching up with
> the backlog.

However, I'm surprised that it took two hours to churn through just 100k 
messages. That should be a few seconds worth on average hardware. Was 
the downstream broker particularly small?

> Now any insight in what RabbitMQ was actually doing during this time is
> appreciated, but I am specifically interested in how publisher confirms
> behave in general. From the docs:
>
> Persistent messages are confirmed when all queues have either delivered
> the message and received an acknowledgement (if required), or persisted
> the message
>
> What is not clear to me is whether there is a chance for one or more
> slow consumers, as in this case, to slow down the entire federation due
> to the downstream broker waiting for their acknowledge for delivered
> messages, which they are not able to give soon as they are still trying
> to catch up with the backlog. So if the federation uses publisher
> confirms and the downstream is not acking messages to the upstream
> because the clients have not all acknowledge them, then also the
> upstream will be slowed down and its outbound queue not emptied as long
> as consumers on the downstream ack their messages. If this is the case I
> would think it is a bit weird for slow consumers on a broker to also
> affect what happens on another broker.
>
> When and how does the broker decide whether to confirm messages because
> they were "delivered and acked" or "persisted"? I would rather prefer it
> did it when persisting them, rather than when delivering them to clients
> which cannot acknowledge them in time.

The broker will deliver the message and also schedule it to be 
persisted. It will then send back a confirm when *either* it has 
received an ack from the client *or* an fsync from the disk. So slow 
consumers should not be able to slow down a federation.

So I think the problem you were seeing was due to messages building up 
before they got to the queues on the downstream broker. But I'm 
surprised only 100k messages could do this.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, VMware