[rabbitmq-discuss] Erlang crashes reports

Fri Sep 17 10:11:38 BST 2010

Hi Romary,

On 16/09/10 16:38, romary.kremer at gmail.com wrote:
> 
> Le 16 sept. 10 à 16:27, Emile Joubert a écrit :
> 
>>
>> Hi Romary,
>>
>> On 16/09/10 14:54, romary.kremer at gmail.com wrote:
>>> Well, I' ve done another run with the whole upgraded configuration
>>> - RabbitMQ version 2.1.0
>>> - Erlang R14B (version 5.8.1)
>>> - No SSL connections (but SSL listener still active on the broker, not
>>> used)
>>>
>>> The memory_high_watermark is set to 0.8, 80% equivalent to 1609 MB !
>>
>> That is too high. It is not advised to set the highwater mark above 50%.
>> Can you reproduce the problem while the vm_memory_high_watermark is set
>> to 0.4 ?
> 
> Since we started to investigate on RabbitMQ, we started (with version
> 1.7.2) we have never been able to successfully open
> 10 000 SSL connection with a watermark less that 0.8 (1609 MB) !

If you need to set the highwater mark above 50% then it means your
server should have more RAM. Are you able to execute the test on a
server with 4Gb or more?

> The fact is that with 1.7.2 and 1.8.x  we successfully managed to accept
> 10 000 connections, without SSL, with the default watermark of 0.4
> Let me get back into the past to persent the history :
>     - we started with default setting of rabbitMQ 1.7.2, with a version
> that does not use  SSL. It works fine.
> 
>     - Then our requirements evolved, and we decided that peers must
> authenticate the borker upon connection, so we started to investigate on
> the use of SSL.
>     
>     - From that point, we could not manage a successful run unless we
> increase the watermark up to 0.8.
> 
>     - Now we upgrade to version 2.x.x of RabbitMQ and we realize that
> the application that used to work on 1.8.0, with SSL no longer work on
> the 2.0.0, neither on 2.1.0.
> 
>     - That's why we decided to run the same test, disabling the SSL
> authentication, and the same issues happens again either on 2.0.0 or 2.1.0.
> 
>>
>>> I've monitored with rabbit_status, even someone had advised not to while
>>> benchmarking (why not ?)
>>
>> rabbitmqctl or rabbit_status are invaluable for debugging the kind of
>> problem you are reporting. Continuous use incurs a small performance
>> penalty and can prevent quiescent resources from hibernating.
> 
> Then what would you recommend us for monitoring the broker a bit less
> intrusively ?

I *would* recommend using rabbitmqctl.

> Actually, in our environment, we have settled a set of tool to monitor
> the CPU load and the overall memory of the rabbitMQ erlang process. For
> that we use the UNIX built in tools
> such as sar, netstat, and ps with option to see the memory occupied.
> 
> We also monitor the queue depth by invoking rabbitmq_ctl list_queues,
> but the problem occurs also whit that monitoring turned OFF !!
> 
> After a test, we are able to built some graphics displaying the
> different statistics monitored.
> If you want, I can send you extract of these reports so you will have
> better overview on how the broker behave during tests.
> 
> Just to have a look, I join you the following snapshots :
> 
> - S0_withSSL_withQueueMonitoring : shows the broker behaviour while
> trying to connect 10 000 peers, with SSL
> 
> - S0_noSSL_withQueueMonitoring : shows the broker behaviour while trying
> to connect 10 000 peers wihtout SSL
> 
> - S0_noSSL_noQueueMonitoring : shows the broker behaviour while trying
> to connect 10 000 peers without SSL, and without periodic call to
> rabbitmqctl list_queues

These graphs show that you create the connections at a fast rate. Do you
get the same failure if you create connections at a lower rate?

>>> The memory available raised the threshold before the end of connections
>>> of 10 000 peers.
>>>
>>> Maybe the average memory occupied by a single, non SSL connection has
>>> somehow get bigger between
>>> release 1.8.x and 2.x.x ??
>>
>> The data structures have changed from 1.8.1 to 2.0.0, but I don't think
>> that is the cause of the problem.
>>
>>> Does anybody has experiment or knows the impact of the new release on
>>> the memory occupied by connections ?

It is possible that more memory is required per connection, If using a
large number of connections caused your RAM budget to be approached in
versions prior to 2.0.0 then you may now exceed it.

>>> I insist in the fact that, in our environment, we can toggle SSL
>>> authentication of the broker by peers, but we always
>>> keep the SSL listener running on the broker. The peer just "decide" to
>>> connect either on 5672 or 5671. In the later,
>>> SSL authentication will be enable, in the former, it won't !
>>>
>>> Thanks for any other idea we can follow, because we are facing a bit of
>>> a dead end since we upgraded to 2.x.x !
>>
>> I'm surprised that the logfile you posted previously stops after opening
>> ssl and tcp listeners. I would expect to see evidence of 4000+ clients
>> connecting.
> 
> Yes sorry for that, but it indeed contains those line for connection
> accepted / started, but noting interesting after that. Basically we jump
> from a series of connection started
> to a series of connection stopped, without no other warning, nor error,
> until the end of the log file.

That is very strange. I would expect to see evidence of the server
running out of memory.

>> Are you able to get output from "rabbitmqctl list_connections" before
>> the broker becomes unresponsive? What is the status of the beam or
>> beam.smp process at that point - is it still running? How much memory
>> and CPU is it consuming?
> 
> Those information can be gathered by our own "monitoring framework" as
> explained above. But as you said just before, the rabbitmq_ctl may be
> too intrusive, we use netstat to list the
> established connection on the port 5672 (5671 if SSL is turned ON).

rabbitmqctl does incur a small performance penalty if run continuously
and it may then prevent unused resources from hibernating. However it
was designed precisely to investigate the kind of problem you are
experiencing.

Can you confirm whether the beam or beam.smp process is running after
the test failed?

Regards

Emile