[rabbitmq-discuss] Lower delivery rate than publish rate - why?

Wed Dec 18 18:33:16 GMT 2013

Well, multi-ack didn't help very much. We can see some, but not enough to
matter.

We cannot use auto-ack because consumers (multiple/server) die unexpectedly
as the app servers are autoscaled. We have not built a fully separated
service yet (too hard to debug on development machines right now). But
could Publisher confirms resolve the issue of servers dying with n messages
in their prefetch buffers?

-- 

*Mike Templeman*
*Head of Development*

T: @talkingfrog1950 <http://twitter.com/missdestructo>
T: @Meshfire <http://twitter.com/meshfire>

On Sun, Dec 15, 2013 at 5:02 AM, Alvaro Videla-2 [via RabbitMQ] <
ml-node+s1065348n32095h82 at n5.nabble.com> wrote:

> Hi Mike,
>
> Yes, RabbitMQ queues are designed for fast delivery of messages and for
> being as empty as possible, as that blog post explains.
>
> Another interesting blog post, about consumer strategies and basic.qos
> settings is this one:
> http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/#more-276
>
> re multi ack: yes, that might help.
>
> Regards,
>
> Alvaro
>
>
> On Sat, Dec 14, 2013 at 2:15 AM, MikeTempleman <[hidden email]<http://user/SendEmail.jtp?type=node&node=32095&i=0>
> > wrote:
>
>> I realized that was a bad interpretation. Sorry. The exchange is just
>> successfully routing all the messages to the target queues.
>>
>> After reading a number of posts and this blog entry (
>> http://www.rabbitmq.com/blog/2011/09/24/sizing-your-rabbits/) I wonder
>> if the issue is that each message is ack'd. It seemed that the issue
>> occurred when I had a large backlog in the queues. When Rabbit is empty,
>> performance is fine. When the consumers tried to run at much higher speeds,
>> we encountered this cycling.
>>
>> We have run a brief test with no-ack, not on production, and the
>> performance is excellent even under load. But this is not a viable solution
>> (appservers crash, autoscaling shuts servers down that have prefetched
>> messages and are still connected to rabbit) without a full redesign.
>>
>> Assuming each queue is only one thread (I assume it handles both receipt,
>> delivery, and ack cleanup) then I can understand what might happen when the
>> consumers generate ~500 acks/s while new messages are coming in at a low
>> 50-100/s rate on a specific queue. I will move out some events that tend to
>> generate peaks into their own queue and accept that queue processing more
>> slowly. As for separating the real worker queue, I suppose I could create 2
>> or so static queues to divide the load up.
>>
>> So what I think I can do is:
>> 1. bump the default tcp buffer from 128KB to around 10MB. The added
>> buffering may help a little
>> 2. see if I can find out how to set the multiple ack flag. We are using
>> Grails so maybe that is just creating a custom bean. I don't know
>> 3. create a couple of queues for lower-priority events specifically
>> events chosen to be less time critical.
>> 4. if all that doesn't work then probably create 4 queues for the high
>> priority events, randomly publish to those queues, and put consumers for
>> each queue.
>> 5. Also, upgrade the server to the latest version.
>>
>> Mike Templeman
>>
>> --
>>
>> *Mike Templeman*
>> *Head of Development*
>>
>> T: @talkingfrog1950 <http://twitter.com/missdestructo>
>> T: @Meshfire <http://twitter.com/meshfire>
>>
>>
>>
>> On Fri, Dec 13, 2013 at 1:42 PM, Mike Templeman <[hidden email]<http://user/SendEmail.jtp?type=node&node=32089&i=0>
>> > wrote:
>>
>>> I noticed something else very odd.
>>>
>>> Currently, one queue has 43,000 messages backed up in its queue. But
>>> when I look at the exchange (there is only one exchange) I see that the
>>> message rate in exactly matches the message rate out.
>>>
>>> With such a huge backlog, why would that be? I would have thought that
>>> the consumers (there are 16 total distributed across 4 systems for that
>>> queue with a prefetch of 100) would run at a much higher steady state.
>>>
>>> This exchange also seems to cycle regularly. It appears to run from a
>>> low of around 60/s in and out to 500+ a second in and out.
>>>
>>>  --
>>>
>>> *Mike Templeman*
>>> *Head of Development*
>>>
>>> T: @talkingfrog1950 <http://twitter.com/missdestructo>
>>> T: @Meshfire <http://twitter.com/meshfire>
>>>
>>>
>>>
>>> On Fri, Dec 13, 2013 at 10:40 AM, Mike Templeman <[hidden email]<http://user/SendEmail.jtp?type=node&node=32089&i=1>
>>> > wrote:
>>>
>>>> Also, from observing the Connections screen on the web UI shows that
>>>> no flow control has been recently turned on for any of the four current
>>>> connections (four app servers).
>>>>
>>>> --
>>>>
>>>> *Mike Templeman *
>>>> *Head of Development*
>>>>
>>>> T: @talkingfrog1950 <http://twitter.com/missdestructo>
>>>> T: @Meshfire <http://twitter.com/meshfire>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 13, 2013 at 10:17 AM, Mike Templeman <[hidden email]<http://user/SendEmail.jtp?type=node&node=32089&i=2>
>>>> > wrote:
>>>>
>>>>> Hi Alvaro
>>>>>
>>>>> I would be more than happy to provide logs. But all they have in them
>>>>> is connection and shutdown information. Nothing more. I have just enabled
>>>>> tracing on the vhost and will send the logs shortly. We encounter this
>>>>> issue when under load every day now.
>>>>>
>>>>> Let me tell you our architecture and deployment:
>>>>>
>>>>> rabbitMQ:
>>>>>
>>>>>    - m1.large ec2 instance. Version: RabbitMQ 3.1.5,  Erlang R14B04
>>>>>    - 23 queues (transaction and direct)
>>>>>    - 3 exchanges used, two fanout and one topic exchange
>>>>>    - topic exchange
>>>>>    - Topic exchange overview is attached.
>>>>>    - 46 total channels.
>>>>>
>>>>>
>>>>> AppServers
>>>>>
>>>>>    - m1.large tomcat servers running grails application
>>>>>    - 2-7 servers at any one time.
>>>>>    - Consume + publish
>>>>>    - On busy queues, each server has 16 consumers with prefetch at 100
>>>>>    - message sizes on busy queues are ~4KB.
>>>>>    - publishing rates on busiest queue ranges from 16/s to >100/s.
>>>>>    (We need to be able to support 1000/s).
>>>>>
>>>>>
>>>>> Each AppServer connects to a sharded mongodb cluster of 3 shards. Our
>>>>> first suspicion was that something in Mongo or AWS was causing the periodic
>>>>> delay but AWS techs looked into our volume use and said we were only use
>>>>> 25% of available bandwidth.
>>>>>
>>>>> At this moment, we have a modest publish rate (~50-60/s) but a backlog
>>>>> of 50,000 messages for the queue "user". You can see a 10 minute snapshot
>>>>> of the queue and see the cycling.
>>>>>
>>>>> I turned on tracing but the results don't seem to becoming into the
>>>>> log. Is there another way to enable reporting of flow control?
>>>>>
>>>>> Mike Templeman
>>>>>
>>>>>
>>>>>  --
>>>>>
>>>>> *Mike Templeman*
>>>>> *Head of Development*
>>>>>
>>>>> T: @talkingfrog1950 <http://twitter.com/missdestructo>
>>>>> T: @Meshfire <http://twitter.com/meshfire>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 13, 2013 at 6:03 AM, Alvaro Videla-2 [via RabbitMQ] <[hidden
>>>>> email] <http://user/SendEmail.jtp?type=node&node=32089&i=3>> wrote:
>>>>>
>>>>>> Mike,
>>>>>>
>>>>>> Would you be able to provide information more information to help us
>>>>>> debug the problem?
>>>>>>
>>>>>> Tim (from the rabbitmq team) requested more info in order to try to
>>>>>> find answers for this.
>>>>>>
>>>>>> For example, when consumption drops to zero, are there any logs on
>>>>>> the
>>>>>> rabbitmq server that might tell of a flow control mechanism being
>>>>>> activated?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Alvaro
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 13, 2013 at 2:19 AM, MikeTempleman <[hidden email]<http://user/SendEmail.jtp?type=node&node=32063&i=0>>
>>>>>> wrote:
>>>>>>
>>>>>> > Tyson
>>>>>> >
>>>>>> > Did you ever find an answer to this question? We are encountering
>>>>>> virtually
>>>>>> > the exact same problem.
>>>>>> >
>>>>>> > We have a variable number of servers setup as producers and
>>>>>> consumers and
>>>>>> > see our throughput drop to zero on a periodic basis. This is most
>>>>>> severe
>>>>>> > when there are a few hundred thousand messages on rabbit.
>>>>>> >
>>>>>> > Did you just drop Rabbit? Ours is running on an m1.large instance
>>>>>> with RAID0
>>>>>> > ephemeral drives, so size and performance of the disk subsystem is
>>>>>> not an
>>>>>> > issue (we are still in beta). We have spent untold hours tuning our
>>>>>> sharded
>>>>>> > mongodb subsystem only to find out that it is only being 25%
>>>>>> utilized (at
>>>>>> > least it will be blazing fast if we ever figure this out).
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > View this message in context:
>>>>>> http://rabbitmq.1065348.n5.nabble.com/Lower-delivery-rate-than-publish-rate-why-tp29247p32040.html
>>>>>> > Sent from the RabbitMQ mailing list archive at Nabble.com.
>>>>>> > _______________________________________________
>>>>>> > rabbitmq-discuss mailing list
>>>>>> > [hidden email] <http://user/SendEmail.jtp?type=node&node=32063&i=1>
>>>>>> >
>>>>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>>>> _______________________________________________
>>>>>> rabbitmq-discuss mailing list
>>>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=32063&i=2>
>>>>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>  If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>>
>>>>>> http://rabbitmq.1065348.n5.nabble.com/Lower-delivery-rate-than-publish-rate-why-tp29247p32063.html
>>>>>>  To unsubscribe from Lower delivery rate than publish rate - why?, click
>>>>>> here.
>>>>>> NAML<http://rabbitmq.1065348.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>> ------------------------------
>> View this message in context: Re: Lower delivery rate than publish rate
>> - why?<http://rabbitmq.1065348.n5.nabble.com/Lower-delivery-rate-than-publish-rate-why-tp29247p32089.html>
>>  Sent from the RabbitMQ mailing list archive<http://rabbitmq.1065348.n5.nabble.com/>at Nabble.com.
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> [hidden email] <http://user/SendEmail.jtp?type=node&node=32095&i=1>
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=32095&i=2>
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://rabbitmq.1065348.n5.nabble.com/Lower-delivery-rate-than-publish-rate-why-tp29247p32095.html
>  To unsubscribe from Lower delivery rate than publish rate - why?, click
> here<http://rabbitmq.1065348.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=29247&code=bWlrZUBtZXNoZmlyZS5jb218MjkyNDd8MTYzNTUyMDM4MA==>
> .
> NAML<http://rabbitmq.1065348.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Lower-delivery-rate-than-publish-rate-why-tp29247p32205.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131218/a38d800c/attachment.html>