[rabbitmq-discuss] Preventing uncontrolled crashes due to file descriptor exhaustion

Mon Jul 12 20:01:28 BST 2010

Hello RabbitMQ users,

While load testing my RabbitMQ-based RPC mechanism, I managed to get my
RabbitMQ server into some very interesting states when it had a large number
of connections. When the AMQP server exceeded ~1020 active connections, it
would become unstable and eventually crash in a way that lost some data
(including a few persistent messages that had been posted to durable
queues).

The cause for the crash wasn't hard to discover: the parent process of
RabbitMQ was restricted to 1,024 open file handles (this is apparently the
default for the Linux distro I was running), and the resource limit is
inherited by child processes. Simply adding a "ulimit -n 65535" to the
RabbitMQ init script and a "+P 131072" to the Erlang VM command-line gave
the server enough file handles and Erlang processes to handle the load.

What piqued my interest, however, was the catastrophic and data-lossy way in
which the server crashed when it reached its limit. Normally, RabbitMQ is
very good about avoiding data loss even when it crashes!

Some scrutiny of the logs yielded the following explanation: exhausting the
file handles available to the process prevents various fault-tolerance and
process control mechanisms from working, including:
-- a "cpu-sup" process that the VM is trying to communicate with via a port
-- writing the Mnesia tables that hold persisted queue contents
-- the "rabbitmqctl" process

The result of all these failures taken together is that the server decides
to shutdown but can't shutdown cleanly, leading to data loss.

My ultimate solution is rather crude, but workable: using the iptables
conntrack module, I will limit the number of inbound TCP connections to the
server and ensure that the server has enough free file handles to take care
of "housekeeping" operations.

I thought I'd share my results with the group in case anyone else has
encountered this problem, and also query whether anyone else has come up
with a different/better solution. Has anyone run into this yet?

Cheers,
    Tony

P.S. Here are some log excerpts from a RabbitMQ server under heavy load that
exemplify the problem:

=ERROR REPORT==== 8-Jul-2010::19:23:13 ===
Error in process <0.121.0> on node 'rabbit at TonyS' with exit value:
{{badmatch,{error,emfile}},[{cpu_sup,get_uint32_measurement,2},{cpu_sup,measurement_server_loop,1}]}

=ERROR REPORT==== 8-Jul-2010::19:23:18 ===
Error in process <0.5307.0> on node 'rabbit at TonyS' with exit value:
{emfile,[{erlang,open_port,[{spawn,"/usr/lib/erlang/lib/os_mon-2.2.5/priv/bin/cpu_sup"},[stream]]},{cpu_sup,start_portprogram,0},{cpu_sup,port_server_init,1}]}

=ERROR REPORT==== 8-Jul-2010::19:24:14 ===
Mnesia(rabbit at TonyS): ** ERROR ** (could not write core file: emfile)
 ** FATAL ** Cannot open log file
"/var/lib/rabbitmq/mnesia/rabbit/rabbit_durable_queue.DCL": {file_error,

"/var/lib/rabbitmq/mnesia/rabbit/rabbit_durable_queue.DCL",

                  emfile}

=INFO REPORT==== 8-Jul-2010::19:24:14 ===
    application: mnesia
    exited: shutdown
    type: permanent

*-------^ someone decides to shutdown everything; tries to do a clean
shutdown but can't persist state  ^-------*

=ERROR REPORT==== 8-Jul-2010::19:24:14 ===
** gen_event handler rabbit_error_logger crashed.
** Was installed in error_logger
** Last event was: {error,<0.39.0>,
                       {<0.42.0>,
                        "Mnesia(~p): ** ERROR ** (could not write core file:
~p)~n ** FATAL ** Cannot open log file ~p: ~p~n",
                        [rabbit at TonyS,emfile,

"/var/lib/rabbitmq/mnesia/rabbit/rabbit_durable_queue.DCL",
                         {file_error,

"/var/lib/rabbitmq/mnesia/rabbit/rabbit_durable_queue.DCL",
                             emfile}]}}

*-------^ can't even write a crash dump ^-------*
*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100712/35f49704/attachment.htm>