[rabbitmq-discuss] queue hands - timeout

Fri May 20 12:02:36 BST 2011

Hi Thomas,

On 19/05/11 13:35, Thomas Stagl wrote:
> Hello,
>
> we are running a queue in our test environment with different vhosts.
> When we do a long running perf test, we recognise that the responses
> from the queue are timing out and from that moment on, it is not
> possible to connect to the queue again. Neither with the java lib
> (which we use in our application) nor with celery (which I use for
> testing purposes.)
>
> We have recognised that beam.smp is running on 30% CPU time from that
> moment on.
>
> When we trace a connection, we see this with tcpdump: 12:23:52.700093
> IP CLIENT.43248>  RABBIT.amqp: Flags [P.], seq 2421288989:2421289002,
> ack 2617563671, win 54, options [nop,nop,TS val 2047452 ecr
> 1377663677], length 13 12:23:52.700124 IP RABBIT.amqp>  CLIENT.43248:
> Flags [.], ack 13, win 62, options [nop,nop,TS val 1377881158 ecr
> 2047452], length 0
>
> From that moment on, the connection is hanging.
>
> After a restart of the rabbitmq server, everything is running fine
> again, for a certain amount of time.
>
> We also have a second test stack and we switched to the second
> rabbitmq test server, same behaviour here. It worked fine and after a
> couple of hours it stopped responding.
>
> Any ideas? Any though about where we can start debugging would be
> very much appreciated.

We have brokers that have been running for weeks without exhibiting the 
behaviour you describe. Can you help us to emulate your test environment 
more closely?

What version of Erlang, OS and rabbit are you using?
What is the contents of the rabbit config file?
What exactly does the perf test doing? What is the rate and size of 
messsages, number of producers, consumers, are the messages persistent?
If you have the management plugin installed then a copy of the broker 
configuration may be useful.

Does rabbitmqctl still work when the problem occurs? The output of all 
the list_* commands at the time of failure will be useful.

The broker logfile entries around the onset of the problem will also help.

Were there any other applications running on the same OS as rabbit?

Were the consumers (if any) keeping up with producers at or just before 
the onset of the problem?

What was the memory consumption reported by the management plugin at 
this time?

If you can help us to narrow this down to a specific set of conditions 
that reliably trigger the problem that will be a big step towards a 
solution.

Thanks

Emile