[rabbitmq-discuss] Preventing uncontrolled crashes due to file descriptor exhaustion

Tue Jul 13 11:24:09 BST 2010

On Mon, Jul 12, 2010 at 20:01, Tony Spataro <tony at rightscale.com> wrote:
> While load testing my RabbitMQ-based RPC mechanism, I managed to get my
> RabbitMQ server into some very interesting states when it had a large number
> of connections. When the AMQP server exceeded ~1020 active connections, it
> would become unstable and eventually crash in a way that lost some data
> (including a few persistent messages that had been posted to durable
> queues).

Well, don't do it. File descriptors are a finite resource and bad things
can happen when you're getting close to the limit. The same applies
to memory and disk.

But I agree, rabbit shouldn't loose the data.

In the new persister branch "bug21673" we've made some magical tweaks.
In this branch we try to make sure that there are always enough file descriptors
for erlang.

Could you repeat your test on that branch, and tell us if it's any better?

Cheers!
  Marek Majkowski

> The cause for the crash wasn't hard to discover: the parent process of
> RabbitMQ was restricted to 1,024 open file handles (this is apparently the
> default for the Linux distro I was running), and the resource limit is
> inherited by child processes. Simply adding a "ulimit -n 65535" to the
> RabbitMQ init script and a "+P 131072" to the Erlang VM command-line gave
> the server enough file handles and Erlang processes to handle the load.
> What piqued my interest, however, was the catastrophic and data-lossy way in
> which the server crashed when it reached its limit. Normally, RabbitMQ is
> very good about avoiding data loss even when it crashes!
> Some scrutiny of the logs yielded the following explanation: exhausting the
> file handles available to the process prevents various fault-tolerance and
> process control mechanisms from working, including:
> -- a "cpu-sup" process that the VM is trying to communicate with via a port
> -- writing the Mnesia tables that hold persisted queue contents
> -- the "rabbitmqctl" process
> The result of all these failures taken together is that the server decides
> to shutdown but can't shutdown cleanly, leading to data loss.
> My ultimate solution is rather crude, but workable: using the iptables
> conntrack module, I will limit the number of inbound TCP connections to the
> server and ensure that the server has enough free file handles to take care
> of "housekeeping" operations.
> I thought I'd share my results with the group in case anyone else has
> encountered this problem, and also query whether anyone else has come up
> with a different/better solution. Has anyone run into this yet?
> Cheers,
>     Tony