[rabbitmq-discuss] Erlang crashes reports

Tue Sep 21 08:03:18 BST 2010

Hi Emile, thanks for the supports and the suggestions.

Le 17 sept. 10 à 11:11, Emile Joubert a écrit :

>
> Hi Romary,
>
> On 16/09/10 16:38, romary.kremer at gmail.com wrote:
>>
>> Le 16 sept. 10 à 16:27, Emile Joubert a écrit :
>>
>>>
>>> Hi Romary,
>>>
>>> On 16/09/10 14:54, romary.kremer at gmail.com wrote:
>>>> Well, I' ve done another run with the whole upgraded configuration
>>>> - RabbitMQ version 2.1.0
>>>> - Erlang R14B (version 5.8.1)
>>>> - No SSL connections (but SSL listener still active on the  
>>>> broker, not
>>>> used)
>>>>
>>>> The memory_high_watermark is set to 0.8, 80% equivalent to 1609  
>>>> MB !
>>>
>>> That is too high. It is not advised to set the highwater mark  
>>> above 50%.
>>> Can you reproduce the problem while the vm_memory_high_watermark  
>>> is set
>>> to 0.4 ?
>>
>> Since we started to investigate on RabbitMQ, we started (with version
>> 1.7.2) we have never been able to successfully open
>> 10 000 SSL connection with a watermark less that 0.8 (1609 MB) !
>
> If you need to set the highwater mark above 50% then it means your
> server should have more RAM. Are you able to execute the test on a
> server with 4Gb or more?
>
>> The fact is that with 1.7.2 and 1.8.x  we successfully managed to  
>> accept
>> 10 000 connections, without SSL, with the default watermark of 0.4
>> Let me get back into the past to persent the history :
>>    - we started with default setting of rabbitMQ 1.7.2, with a  
>> version
>> that does not use  SSL. It works fine.
>>
>>    - Then our requirements evolved, and we decided that peers must
>> authenticate the borker upon connection, so we started to  
>> investigate on
>> the use of SSL.
>>
>>    - From that point, we could not manage a successful run unless we
>> increase the watermark up to 0.8.
>>
>>    - Now we upgrade to version 2.x.x of RabbitMQ and we realize that
>> the application that used to work on 1.8.0, with SSL no longer work  
>> on
>> the 2.0.0, neither on 2.1.0.
>>
>>    - That's why we decided to run the same test, disabling the SSL
>> authentication, and the same issues happens again either on 2.0.0  
>> or 2.1.0.
>>
>>>
>>>> I've monitored with rabbit_status, even someone had advised not  
>>>> to while
>>>> benchmarking (why not ?)
>>>
>>> rabbitmqctl or rabbit_status are invaluable for debugging the kind  
>>> of
>>> problem you are reporting. Continuous use incurs a small performance
>>> penalty and can prevent quiescent resources from hibernating.
>>
>> Then what would you recommend us for monitoring the broker a bit less
>> intrusively ?
>
> I *would* recommend using rabbitmqctl.
>
>> Actually, in our environment, we have settled a set of tool to  
>> monitor
>> the CPU load and the overall memory of the rabbitMQ erlang process.  
>> For
>> that we use the UNIX built in tools
>> such as sar, netstat, and ps with option to see the memory occupied.
>>
>> We also monitor the queue depth by invoking rabbitmq_ctl list_queues,
>> but the problem occurs also whit that monitoring turned OFF !!
>>
>> After a test, we are able to built some graphics displaying the
>> different statistics monitored.
>> If you want, I can send you extract of these reports so you will have
>> better overview on how the broker behave during tests.
>>
>> Just to have a look, I join you the following snapshots :
>>
>> - S0_withSSL_withQueueMonitoring : shows the broker behaviour while
>> trying to connect 10 000 peers, with SSL
>>
>> - S0_noSSL_withQueueMonitoring : shows the broker behaviour while  
>> trying
>> to connect 10 000 peers wihtout SSL
>>
>> - S0_noSSL_noQueueMonitoring : shows the broker behaviour while  
>> trying
>> to connect 10 000 peers without SSL, and without periodic call to
>> rabbitmqctl list_queues
>
> These graphs show that you create the connections at a fast rate. Do  
> you
> get the same failure if you create connections at a lower rate?

We have a ramp up scenario to evaluate the kind of thing you're  
thinking of. the same
crashes occurs the same way about 4000 connections established.

>
>>>> The memory available raised the threshold before the end of  
>>>> connections
>>>> of 10 000 peers.
>>>>
>>>> Maybe the average memory occupied by a single, non SSL connection  
>>>> has
>>>> somehow get bigger between
>>>> release 1.8.x and 2.x.x ??
>>>
>>> The data structures have changed from 1.8.1 to 2.0.0, but I don't  
>>> think
>>> that is the cause of the problem.
>>>
>>>> Does anybody has experiment or knows the impact of the new  
>>>> release on
>>>> the memory occupied by connections ?
>
> It is possible that more memory is required per connection, If using a
> large number of connections caused your RAM budget to be approached in
> versions prior to 2.0.0 then you may now exceed it.
>
>>>> I insist in the fact that, in our environment, we can toggle SSL
>>>> authentication of the broker by peers, but we always
>>>> keep the SSL listener running on the broker. The peer just  
>>>> "decide" to
>>>> connect either on 5672 or 5671. In the later,
>>>> SSL authentication will be enable, in the former, it won't !
>>>>
>>>> Thanks for any other idea we can follow, because we are facing a  
>>>> bit of
>>>> a dead end since we upgraded to 2.x.x !
>>>
>>> I'm surprised that the logfile you posted previously stops after  
>>> opening
>>> ssl and tcp listeners. I would expect to see evidence of 4000+  
>>> clients
>>> connecting.
>>
>> Yes sorry for that, but it indeed contains those line for connection
>> accepted / started, but noting interesting after that. Basically we  
>> jump
>> from a series of connection started
>> to a series of connection stopped, without no other warning, nor  
>> error,
>> until the end of the log file.
>
> That is very strange. I would expect to see evidence of the server
> running out of memory.
>
>>> Are you able to get output from "rabbitmqctl list_connections"  
>>> before
>>> the broker becomes unresponsive? What is the status of the beam or
>>> beam.smp process at that point - is it still running? How much  
>>> memory
>>> and CPU is it consuming?
>>
>> Those information can be gathered by our own "monitoring framework"  
>> as
>> explained above. But as you said just before, the rabbitmq_ctl may be
>> too intrusive, we use netstat to list the
>> established connection on the port 5672 (5671 if SSL is turned ON).
>
> rabbitmqctl does incur a small performance penalty if run continuously
> and it may then prevent unused resources from hibernating. However it
> was designed precisely to investigate the kind of problem you are
> experiencing.

I have double checked that the same issue happens,even with no  
monitoring
with rabbitmqctl.

>
> Can you confirm whether the beam or beam.smp process is running after
> the test failed?
>
>
>
> Regards
>
> Emile

We are a bit of worry considering that all the propositions we get to  
fix this issue are
related to give the broker more memory, while we thought the purpose  
of the 2.x.x
was to allow the broker to rely more on disk. Moreover, we are  
wondering what have
caused a connection to be so greedy that we can no longer establish 10  
000 on the same
configuration.

Since we do not have a 4GB server available to run the same tests,  
and, as we have quite
short dead line to set up a field test, we are considering to get back  
using the rabbitMQ release 1.8.1,
with Erlang R14B, hopping that this would be the winning combinaison.

Best regards,

Romary.